Wrapper - npmajisha/bag-the-job GitHub Wiki
- [Python 3] (https://www.python.org/downloads/release/python-342/)
- [boto3] (http://boto3.readthedocs.org/en/latest/guide/quickstart.html)
- [BeautifulSoup4] (https://pypi.python.org/pypi/beautifulsoup4/4.4.0)
- [html2text] (https://github.com/aaronsw/html2text)
- [htmlparser] (https://github.com/akhilram/htmlparser)
- See the setup for crawler [here] (https://github.com/npmajisha/bag-the-job/wiki/Crawler#aws-setup)
- In addition, create an S3 bucket for the extracted content
- Clone the repo into the EC2 instance and cd into wrapper-src
- Create a folder for log files (E.g. logs)
- Add configuration details into a json file (E.g. config.json)
- Run the command 'nohup python3 wrapper.py -c <config file> -lf <logfile> > jobs/std.out 2> jobs/std.err < /dev/null &' to start the process in the background
- To extract a particular attribute from an html tag
"content_filter_params": [
{
"tag": "<tag name>",
"attribute": "<attribute name>",
"value": "<attribute value>",
"type": "<attribute to extract>",
"target": "<target field>"
}
]