User browser vs. Puppeteer

Intro

When crawling the web nowadays most web pages will be SPAs and use various JS frameworks and libraries that render dynamically. This means that the easy way to crawl is by using some type of headless browser.

There are several options that I know for doing this:

Selenium
playwright
puppeteer
…more that I am probably unaware of

For the sake of simplicity I have chosen not to look at things like cypress since the focus is not the testing but more the automation.

I will focus mostly on puppeteer.

How it works

Puppeteer communicates with chromium using the CDP via a websocket. Theoretically this is possible not only in nodejs but any programming language, but in practice the most comprehensive implementation is what the fine people working on puppeteer have implemented. What that means is that you have access to many of the features that are accessible from chrome (cookies, storage, dom, screenshots etc…).

The controversial case for web scraping

Web scraping is a bit of a controversial topic and many website tend to clamp down on automatic browsing.

There are a wide range of methods to figure out if a visitor is real or one of our machine overlords. It varies from checking browser capabilities, cookies, captchas and even more advanced behavioral analysis.

Warning: past this point proceed at your own risk

A way to get an overview of what your current browser capabilities are…that some websites might look at, and block you if you don’t play nice can be found here.

The following snippet shows how to check puppets’ default profile.

import puppeteer from ‘puppeteer‘

// puppeteer usage as normal
puppeteer.launch({ headless: true }).then(async browser => {
console.log(‘Running tests..‘)
const page = await browser.newPage()
await page.goto(‘https://bot.sannysoft.com‘)
await page.screenshot({ path: ‘./screenshots/testresult.png‘, fullPage: true })
await browser.close()
console.log(`All done, check the screenshot. ✨`)
})

Now if you want to get the site believe you you are playing nice, you need to find a way to get this check passing you need a few more modules.

# with pnpm you can install the required as follows

pnpm i puppeteer-extra puppeteer-extra-plugin-stealth

And then do the same but using the stealth plugin.

import puppeteer from ‘puppeteer-extra‘

// add stealth plugin and use defaults (all evasion techniques)
import StealthPlugin from ‘puppeteer-extra-plugin-stealth‘
puppeteer.use(StealthPlugin())

Conclusions

When crawling you should behave as a human would
There is no way to fully pretend…but it is fun to try.
Be polite and don’t do this a mega scale so that you don’t crash servers

Stiri similare

Chicago woman charged with biting cop at Hammond Walmart

Así ha sido el último punto de Nadal en el Mutua Madrid Open y sus partidos contra Djokovic y Federer en la Caja Mágica

Daily News boys athlete of the week: Dylan Volantis, Westlake

The Cheyenne Supercomputer is going for a fraction of its list price at auction right now

City celebrates townhome transformation in Nob Hill

Top battleground Senate race heats up as party-backed Republican faces onslaught from former Trump official

User browser vs. Puppeteer

Intro

How it works

The controversial case for web scraping

Warning: past this point proceed at your own risk

Conclusions

Related

Leave a Reply Cancel reply

Intro

How it works

The controversial case for web scraping

Warning: past this point proceed at your own risk

Conclusions

Share on:

Related

Leave a Reply Cancel reply

Stiri similare