Crawler - npmajisha/bag-the-job GitHub Wiki

Required packages

scrapy
BeautifulSoup
[html2text] (https://github.com/aaronsw/html2text)
htmlparser

AWS setup

Create IAM role and provide EC2 and S3 permissions to the role
Create a new EC2 instance and provide the role created above as IAM role
ssh to the EC2 instance and install required packages
Create an S3 bucket to store crawled pages

Crawling

Clone the repo into the EC2 instance and cd into crawler-src/crawler folder
Add configuration details into a json file (config.json)
run the command 'nohup scrapy crawl gen-crawler -s JOBDIR=jobs/ -a config=<config file> > jobs/std.out 2> jobs/std.err < /dev/null &' to start the crawling process in the background

⚠️ GitHub.com Fallback ⚠️