Wrapper - npmajisha/bag-the-job GitHub Wiki

Required packages

[Python 3] (https://www.python.org/downloads/release/python-342/)
[boto3] (http://boto3.readthedocs.org/en/latest/guide/quickstart.html)
[BeautifulSoup4] (https://pypi.python.org/pypi/beautifulsoup4/4.4.0)
[html2text] (https://github.com/aaronsw/html2text)
[htmlparser] (https://github.com/akhilram/htmlparser)

AWS Setup

See the setup for crawler [here] (https://github.com/npmajisha/bag-the-job/wiki/Crawler#aws-setup)
In addition, create an S3 bucket for the extracted content

Running the wrapper

Clone the repo into the EC2 instance and cd into wrapper-src
Create a folder for log files (E.g. logs)
Add configuration details into a json file (E.g. config.json)
Run the command 'nohup python3 wrapper.py -c <config file> -lf <logfile> > jobs/std.out 2> jobs/std.err < /dev/null &' to start the process in the background

Configuration

To extract a particular attribute from an html tag

"content_filter_params": [
    {
      "tag": "<tag name>",
      "attribute": "<attribute name>",
      "value": "<attribute value>",
      "type": "<attribute to extract>",
      "target": "<target field>"
    }
]

⚠️ GitHub.com Fallback ⚠️