Getting Started with Scraping Links from Webpages - strohne/Facepager GitHub Wiki
This Getting Started introduces you to the basics of webscraping with the help of Facepager. Webscraping can be used to extract information from the HTML source code of webpages.
With the Generic Module in Facepager you can download the HTML source code from a webpage and simultaneously extract all the links from that page. These links are automatically added as new nodes and can then be used as starting points for further downloads. Following the links is a technique called webcrawling. By crawling the web you can, for example, explore the Blogosphere.
-
Create a database: Click
New Database
in the Menu Bar to create a blank database. -
Add seed node: Add the URL of a webpage as a first node. This seed node will serve as the starting point for the scraping. Click
Add Nodes
in the Menu Bar and, for example, enter "https://en.wikipedia.org/wiki/Web_scraping". -
Download source code and extract links: A first query is used to download the source code and extract all the links from the webpage. Click on
Presets
in the Menu Bar and apply the preset "Extract links from webpage". This preset works with the Generic Module. Note that the setting for the response format is "links". This setting instructs Facepager to extract the links from the webpage. Adjust the download folder in theQuery Setup
. The HTML source code of the webpage will be stored in a file in the selected folder. Fetch the data by selecting the node and clickingFetch data
.
-
Inspect links: Inspect the data by manually expanding your node or by using the button
Expand nodes
in the Menu Bar. You should see all the links from that webpage added as child nodes. To inspect the links, select a child node and look at the Data View to the right. To inspect the HTML source code of the webpage, go to your download folder and open the file with a text editor such as Notepad++. -
Follow the links: In a second query, you can download the webpages that are referenced by the URLs in the child nodes. Select the seed node and change the
Node level
in the General Settings section to2
. Then clickFetch data
. This will automatically apply your request to all the child nodes on the second level of the tree. Inspect the links by expanding the nodes. Note: Depending on the number of child nodes you have on that level, the download process may take some time. You can speed it up by increasing the number ofParallel Threads
in the Settings section. You can repeat the link extraction for as many levels as you wish, just increase theNode level
step by step.
-
Setup columns: To show data in the Nodes View, adapt the Column Setup. First, click
Clear Column Setup
underneath the "Custom Table Columns" area. Second, add keys found in the Data View into the text field. UseAdd Column
for a specific key or just useAdd All Columns
. After manually adding columns, clickApply Column Setup
. -
Export data: To export the links, expand the nodes and select all the nodes you want to export (or their parent nodes). Click
Export Data
to get a CSV file. Notice the options in the export mode field of the export dialog. You can open CSV files with Excel or any statistics software you like.
What is next?
In order to analyze the data, you can use different kinds of statistics software:
- If you want to filter or sort the links, open the exported CSV file with Excel or R. You can add selected links as new nodes in Facepager if you want to follow only specific links from a webpage.
- If you want to extract information from the gathered source code, a basic understanding of HTML is useful. For an introduction to HTML, have a look at the HTML Tutorial from W3schools. With Facepager you can extract data using CSS selectors, XPath and regular expressions.
- You can use Gephi to visualize a network of linked websites. Have a look at the Getting Started with YouTube Networks.
- A simple way to find and extract data from HTML is using a text editor such as Notepad++ and regular expressions. For more control, use a programming language such as Python and the library Beautiful Soup. Beautiful Soup was created for extracting data from HTML and XML files.
To learn more about Facepager, have a look at the Basic Concepts.
Credits go to ChantalGrtnr!