Getting Started with Scraping Links from Webpages - strohne/Facepager GitHub Wiki

This Getting Started introduces you to the basics of webscraping with the help of Facepager. Webscraping can be used to extract information from the HTML source code of webpages.

With the Generic Module in Facepager you can download the HTML source code from a webpage and simultaneously extract all the links from that page. These links are automatically added as new nodes and can then be used as starting points for further downloads. Following the links is a technique called webcrawling. By crawling the web you can, for example, explore the Blogosphere.

  1. Create a database: Click New Database in the Menu Bar to create a blank database.

  2. Add seed node: Add the URL of a webpage as a first node. This seed node will serve as the starting point for the scraping. Click Add Nodes in the Menu Bar and, for example, enter "https://en.wikipedia.org/wiki/Web_scraping".

  3. Download source code and extract links: A first query is used to download the source code and extract all the links from the webpage. Click on Presets in the Menu Bar and apply the preset "Extract links from webpage". This preset works with the Generic Module. Note that the setting for the response format is "links". This setting instructs Facepager to extract the links from the webpage. Adjust the download folder in the Query Setup. The HTML source code of the webpage will be stored in a file in the selected folder. Fetch the data by selecting the node and clicking Fetch data.

  4. Inspect links: Inspect the data by manually expanding your node or by using the button Expand nodes in the Menu Bar. You should see all the links from that webpage added as child nodes. To inspect the links, select a child node and look at the Data View to the right. To inspect the HTML source code of the webpage, go to your download folder and open the file with a text editor such as Notepad++.

  5. Follow the links: In a second query, you can download the webpages that are referenced by the URLs in the child nodes. Select the seed node and change the Node level in the General Settings section to 2. Then click Fetch data. This will automatically apply your request to all the child nodes on the second level of the tree. Inspect the links by expanding the nodes. Note: Depending on the number of child nodes you have on that level, the download process may take some time. You can speed it up by increasing the number of Parallel Threads in the Settings section. You can repeat the link extraction for as many levels as you wish, just increase the Node level step by step.

  6. Setup columns: To show data in the Nodes View, adapt the Column Setup. First, click Clear Column Setup underneath the "Custom Table Columns" area. Second, add keys found in the Data View into the text field. Use Add Column for a specific key or just use Add All Columns. After manually adding columns, click Apply Column Setup.

  7. Export data: To export the links, expand the nodes and select all the nodes you want to export (or their parent nodes). Click Export Data to get a CSV file. Notice the options in the export mode field of the export dialog. You can open CSV files with Excel or any statistics software you like.

What is next?
In order to analyze the data, you can use different kinds of statistics software:

  • If you want to filter or sort the links, open the exported CSV file with Excel or R. You can add selected links as new nodes in Facepager if you want to follow only specific links from a webpage.
  • If you want to extract information from the gathered source code, a basic understanding of HTML is useful. For an introduction to HTML, have a look at the HTML Tutorial from W3schools. With Facepager you can extract data using CSS selectors, XPath and regular expressions.
  • You can use Gephi to visualize a network of linked websites. Have a look at the Getting Started with YouTube Networks.
  • A simple way to find and extract data from HTML is using a text editor such as Notepad++ and regular expressions. For more control, use a programming language such as Python and the library Beautiful Soup. Beautiful Soup was created for extracting data from HTML and XML files.

To learn more about Facepager, have a look at the Basic Concepts.

Credits go to ChantalGrtnr!

⚠️ **GitHub.com Fallback** ⚠️