Scraping - robertervin/amazon-project GitHub Wiki
Notes:
If you are not running this on a debian-based machine, change the PROJECT_PATH
variable in amazon-project/amazon-scraper/amazon-scraper/settings.py
to where your project is stored.
Testing
To test if the parser works correctly, navigate to amazon-project/amazon_scraper
and run scrapy crawl amazon.com
. You should use raw_input()
to test if the model dictionaries are scraping the correct data. Also, traceback.print_exc()
is a useful function for debugging exceptions.
If you want to test a particular search page, paste the page's url into the start_urls
list and comment out all other urls in that list.
Production
Navigate to amazon-project/query_titles/views.py
and in the scrape()
function at the bottom you should uncomment the commented code lines, and indent all spider configurations to be inside of setup_crawler()
. This will split the urls into chunks of 4 and spin up 4 amazon.com
spiders.