User browser vs. Puppeteer

RMAG news

Intro

When crawling the web nowadays most web pages will be SPAs and use various JS frameworks and libraries that render dynamically. This means that the easy way to crawl is by using some type of headless browser.

There are several options that I know for doing this:

Selenium
playwright
puppeteer
…more that I am probably unaware of

For the sake of simplicity I have chosen not to look at things like cypress since the focus is not the testing but more the automation.

I will focus mostly on puppeteer.

How it works

Puppeteer communicates with chromium using the CDP via a websocket. Theoretically this is possible not only in nodejs but any programming language, but in practice the most comprehensive implementation is what the fine people working on puppeteer have implemented. What that means is that you have access to many of the features that are accessible from chrome (cookies, storage, dom, screenshots etc…).

The controversial case for web scraping

Web scraping is a bit of a controversial topic and many website tend to clamp down on automatic browsing.

There are a wide range of methods to figure out if a visitor is real or one of our machine overlords. It varies from checking browser capabilities, cookies, captchas and even more advanced behavioral analysis.

Warning: past this point proceed at your own risk

A way to get an overview of what your current browser capabilities are…that some websites might look at, and block you if you don’t play nice can be found here.

The following snippet shows how to check puppets’ default profile.

import puppeteer from puppeteer

// puppeteer usage as normal
puppeteer.launch({ headless: true }).then(async browser => {
console.log(Running tests..)
const page = await browser.newPage()
await page.goto(https://bot.sannysoft.com)
await page.screenshot({ path: ./screenshots/testresult.png, fullPage: true })
await browser.close()
console.log(`All done, check the screenshot. ✨`)
})

Now if you want to get the site believe you you are playing nice, you need to find a way to get this check passing you need a few more modules.

# with pnpm you can install the required as follows
pnpm i puppeteer-extra puppeteer-extra-plugin-stealth

And then do the same but using the stealth plugin.

import puppeteer from puppeteer-extra

// add stealth plugin and use defaults (all evasion techniques)
import StealthPlugin from puppeteer-extra-plugin-stealth
puppeteer.use(StealthPlugin())

// puppeteer usage as normal
puppeteer.launch({ headless: true }).then(async browser => {
console.log(Running tests..)
const page = await browser.newPage()
await page.goto(https://bot.sannysoft.com)
await page.screenshot({ path: ./screenshots/testresult.png, fullPage: true })
await browser.close()
console.log(`All done, check the screenshot. ✨`)
})

Conclusions

When crawling you should behave as a human would
There is no way to fully pretend…but it is fun to try.
Be polite and don’t do this a mega scale so that you don’t crash servers

Leave a Reply

Your email address will not be published. Required fields are marked *