Creating Environment for Scraping - TBDM/ju GitHub Wiki

Updated: Mar. 26th, 2017

Fetch Workflow

tbdmfetcher.py

Fetch new items appeared on https://ju.taobao.com/tg/forecast.htm and initiate corresponding task info into Redis(the MQ) and MongoDB(the DB). Fetcher uses urllib.request to request static webpages.

After each work fetcher suspends for 300 secs.

tbdmscraper.py

The dynamic page scraping worker with no class wrapping.

tbdmPipeline.py

The storyboard.

Task List Formatting

String Format on Redis:

score(nextReqTime) juID/itemID/score/status/urlType/fail

JSON Format on Python Runtime:

  {
  'juID' : str(task[0]),
  'itemID' : str(task[1]),
  'score' : int(task[2]),
  'status' : int(task[3]),
  'urlType' : int(task[4]),
  'fail' : int(task[5])
}

Dependencies

CORE: (*optional item)

pyenv

Python 3.6.0 with pip (See Common Build Problems for build requirements)

pip - slacker (0.9.30)

pip - pymongo 3.4.0

pip - redis 2.10.5

pip - selenium (3.3.1)

pip - pyvirtualdisplay (0.2.1)

Xvfb

Firefox

MongoDB Instance

Redis Instance

geckodriver (mv to /usr/bin)

*PhantomJS [deprecated]

*pip - requests (2.12.4)

Parsing:

*pip - beautifulsoup4 (4.5.3)

*pip - lxml (3.7.3)

Operating:

Slack (Android/iOS/PC/Mac/Linux/Web)

Studio 3T (PC/Mac/Linux)

Medis (Mac, Archlinux) / RDM (Linux/Windows/Mac)

*FTP Tools

*SSH tools