Webscraping - strohne/Facepager GitHub Wiki
You can download webpages with the Generic module by adding an URL as seed node and setting the base path to <Object ID>
.
If you set the response format to "text", after downloading HTML files you will find the HTML source code of pages in the text property.
Use CSS selectors, XPath and regular expressions to scrape data using the pipe & modifier syntax inside of keys.
Values can be passed to further functions using the pipe |
operator followed by the modifiers css:
or xpath:
or re:
.
The preset for scraping tables from Wikipedia is a good starting point.
The Extract data-function in the Data View provides a preview for developing your keys.
For example, text|css:div.article
will first select the text key and second pass the value to an XML parser in order to select all elements matching the CSS selector div.article
(=all div elements containing "article" in the class attribute).
The same may be achieved using XPath: text|xpath://div[@class='article']
will first select the text key and second pass the value to an XML parser in order to select all elements matching the given XPath expression (=all div elements with class attribute "article").
To search for text, use regular expression. For example, text|re:[0-9]+
will extract all numbers.
You can even chain functions. For example, text|css:div.article|xpath://text()
will first select the text key, then select div elements with class "article", and finally extract all text in the elements.
If the extracted HTML or XML elements contain JSON, you can convert it from text to JSON with the modifier json:
. Add a key if you are only interested in specific values of the object.
By default, the result is saved in the text key. In order to rename the key, prefix your expression with newkey=
. For example, link=text|xpath://@href
will save all links to the link-key.
The pipe & modifier syntax can be used inside of keys at different places:
- You can extract elements and values when downloading data using the fields
Key to extract
andKey for Object ID
in the Generic Module. New child nodes are created from the extracted data. - In order to show HTML data in the data view and export it as a CSV file you can scrape data in the column setup.
- You can use the same keys in placeholders.
- The function
Extract data
right above the Detail View lets you extract data after downloading. This function follows the same logic like in the downloading step. New child nodes are created from the extracted data.
See the the list of supported modifiers for further post processing options.