Creating Environment for Scraping - TBDM/ju GitHub Wiki
Updated: Mar. 26th, 2017
Fetch Workflow
tbdmfetcher.py
Fetch new items appeared on https://ju.taobao.com/tg/forecast.htm and initiate corresponding task info into Redis(the MQ) and MongoDB(the DB). Fetcher uses urllib.request
to request static webpages.
After each work fetcher suspends for 300 secs.
tbdmscraper.py
The dynamic page scraping worker with no class wrapping.
tbdmPipeline.py
The storyboard.
Task List Formatting
-
String Format on Redis:
score(nextReqTime)
juID/itemID/score/status/urlType/fail
-
JSON Format on Python Runtime:
{ 'juID' : str(task[0]), 'itemID' : str(task[1]), 'score' : int(task[2]), 'status' : int(task[3]), 'urlType' : int(task[4]), 'fail' : int(task[5]) }
Dependencies
CORE: (*optional item)
Python 3.6.0 with pip (See Common Build Problems for build requirements)
pip - slacker (0.9.30)
pip - pymongo 3.4.0
pip - redis 2.10.5
pip - selenium (3.3.1)
pip - pyvirtualdisplay (0.2.1)
Xvfb
Firefox
MongoDB Instance
Redis Instance
geckodriver (mv to /usr/bin)
*PhantomJS [deprecated]
*pip - requests (2.12.4)
Parsing:
*pip - beautifulsoup4 (4.5.3)
*pip - lxml (3.7.3)
Operating:
Slack (Android/iOS/PC/Mac/Linux/Web)
Studio 3T (PC/Mac/Linux)
Medis (Mac, Archlinux) / RDM (Linux/Windows/Mac)
*FTP Tools
*SSH tools