Beginner: I'm new to scraping and being blocked - berstend/puppeteer-extra GitHub Wiki
Problem: Your scraper is being blocked
This wiki aims to be a beginner friendly entry point in understanding why this could happen and how to mitigate it.
Note: This document is only relevant if there are issues, if your custom shell script loop using curl runs fine that's great.
Most common issues
You're using a non-browser based scraper (curl, requests, scrapy, etc)
- The days where this was sufficient are long gone now 😄
- It's easy for a site to use JS to gather or calculate some data and require that in their backend (sent in the form of cookies/headers/post data)
- In addition most sites are built with dynamic JS nowadays, so static html scraping won't get you far
- Solution: Switch to a scraping framework which uses a real browser (puppeteer, playwright)
You're using Selenium
- Selenium is the grandfather of browser based scraping frameworks and leaks it's presence in too many ways
- This applies to anything that is not a real browser as well: Scrapy's Splash, PhantomJS, Electron, CasperJS, etc
- Solution: Don't use Selenium, use puppeteer or playwright
You're using puppeteer without stealth
- By default the usage of puppeteer (in both headless and headful mode) can be detected by a site
- Solution: Use puppeteer-extra-plugin-stealth
You're using non-sensical data
- Don't try to emulate another browser engine or device type (e.g. mobile) when using a desktop browser
- Don't use data that doesn't make sense (e.g. macOS platform with a Nvidia RTX 3080 GPU)
- Don't pretend to be the latest Chrome version (e.g. User-Agent) if you're not
Your IP address is bad
- Don't use free proxies from the internet, they are being detected as such easily
- Don't use Tor, all exit nodes are public and the network is meant for people in need
- Don't use your home internet too often or you might experience rate-limiting or bans
- Don't use datacenter IPs or proxies, they can be detected as not being "residential"
How bot detection works
(TODO: Add more content here)