Scraping - robertervin/amazon-project GitHub Wiki

Notes:

If you are not running this on a debian-based machine, change the PROJECT_PATH variable in amazon-project/amazon-scraper/amazon-scraper/settings.py to where your project is stored.

Testing

To test if the parser works correctly, navigate to amazon-project/amazon_scraper and run scrapy crawl amazon.com. You should use raw_input() to test if the model dictionaries are scraping the correct data. Also, traceback.print_exc() is a useful function for debugging exceptions.

If you want to test a particular search page, paste the page's url into the start_urls list and comment out all other urls in that list.

Production

Navigate to amazon-project/query_titles/views.py and in the scrape() function at the bottom you should uncomment the commented code lines, and indent all spider configurations to be inside of setup_crawler(). This will split the urls into chunks of 4 and spin up 4 amazon.com spiders.