Crawler - npmajisha/bag-the-job GitHub Wiki

Required packages

  1. scrapy
  2. BeautifulSoup
  3. [html2text] (https://github.com/aaronsw/html2text)
  4. htmlparser

AWS setup

  1. Create IAM role and provide EC2 and S3 permissions to the role
  2. Create a new EC2 instance and provide the role created above as IAM role
  3. ssh to the EC2 instance and install required packages
  4. Create an S3 bucket to store crawled pages

Crawling

  1. Clone the repo into the EC2 instance and cd into crawler-src/crawler folder
  2. Add configuration details into a json file (config.json)
  3. run the command 'nohup scrapy crawl gen-crawler -s JOBDIR=jobs/ -a config=<config file> > jobs/std.out 2> jobs/std.err < /dev/null &' to start the crawling process in the background
⚠️ **GitHub.com Fallback** ⚠️