Intro
When crawling the web nowadays most web pages will be SPAs and use various JS frameworks and libraries that render dynamically. This means that the easy way to crawl is by using some type of headless browser.
There are several options that I know for doing this:
Selenium
playwright
puppeteer
…more that I am probably unaware of
For the sake of simplicity I have chosen not to look at things like cypress since the focus is not the testing but more the automation.
I will focus mostly on puppeteer.
How it works
Puppeteer communicates with chromium using the CDP via a websocket. Theoretically this is possible not only in nodejs but any programming language, but in practice the most comprehensive implementation is what the fine people working on puppeteer have implemented. What that means is that you have access to many of the features that are accessible from chrome (cookies, storage, dom, screenshots etc…).
The controversial case for web scraping
Web scraping is a bit of a controversial topic and many website tend to clamp down on automatic browsing.
There are a wide range of methods to figure out if a visitor is real or one of our machine overlords. It varies from checking browser capabilities, cookies, captchas and even more advanced behavioral analysis.
Warning: past this point proceed at your own risk
A way to get an overview of what your current browser capabilities are…that some websites might look at, and block you if you don’t play nice can be found here.
The following snippet shows how to check puppets’ default profile.
// puppeteer usage as normal
puppeteer.launch({ headless: true }).then(async browser => {
console.log(‘Running tests..‘)
const page = await browser.newPage()
await page.goto(‘https://bot.sannysoft.com‘)
await page.screenshot({ path: ‘./screenshots/testresult.png‘, fullPage: true })
await browser.close()
console.log(`All done, check the screenshot. ✨`)
})
Now if you want to get the site believe you you are playing nice, you need to find a way to get this check passing you need a few more modules.
pnpm i puppeteer-extra puppeteer-extra-plugin-stealth
And then do the same but using the stealth plugin.
// add stealth plugin and use defaults (all evasion techniques)
import StealthPlugin from ‘puppeteer-extra-plugin-stealth‘
puppeteer.use(StealthPlugin())
// puppeteer usage as normal
puppeteer.launch({ headless: true }).then(async browser => {
console.log(‘Running tests..‘)
const page = await browser.newPage()
await page.goto(‘https://bot.sannysoft.com‘)
await page.screenshot({ path: ‘./screenshots/testresult.png‘, fullPage: true })
await browser.close()
console.log(`All done, check the screenshot. ✨`)
})
Conclusions
When crawling you should behave as a human would
There is no way to fully pretend…but it is fun to try.
Be polite and don’t do this a mega scale so that you don’t crash servers