URLs and Placeholders - strohne/Facepager GitHub Wiki

Facepager helps you to extract data from the web – by interacting with APIs, Webscraping or downloading and uploading files. In all three cases, URLs are used to access the data.

The underlying principle for data extraction in Facepager, detailed below, is as follows:

  • A resource on the web containing the data is addressed by an URL. Facepager assists in assembling the URL.
  • Parts of the URL can be replaced with placeholders. This simplifies assembling URLs for a large number of resources.
  • The placeholders are filled in Facepager with data from the query setup and the nodes using extraction keys.
  • For uploading files – instead of extracting data –, a special kind of placeholder is used to access the content of a file.

URLs

URL is the short form of "Uniform Resource Locator". It refers to the unique address of a resource on the web – like an HTML webpage, an image or a video. Every URL consists of individual components that follow a certain structure:

In Facepager, the URLs are assembled from data in the Query Setup:

Name Function
Protocol The protocol determines how to transfer data of webpages via the Internet. HTTP is short for "Hypertext Transfer Protocol" and is the standard protocol across the Internet (as well as the secure version HTTPS). It is used to display the content of the webpage in the browser. Another Protocol is FTP, standing for File Transfer Protocol. FTP is used to transfer files, e.g. from the computer to the web.
Domain Domains are used to refer to the webserver hosting the website. A webserver can also be identified by an IP address, though using a domain name is easier than typing a sequence of numbers. Each domain consists of several parts. In this example, "www" is the subdomain, "youtube.com" is the domain and ".com" is the top-level domain.
Base path In Facepager, you can find the field "basepath" in the Query Setup. The basepath is assembled from the protocol and the domain. It can also contain parts of the path, e.g. when you work with a specific version of an API. Data is usually requested using different queries but from the same webserver. You can put the common part of the URLs into the basepath field. For example, https://www.youtube.com is the basepath, if you interact with YouTube or https://www.googleapis.com/youtube/v3 is the basepath for interacting with Version 3 of the YouTube API.
Path to Resource The path refers to a unique resource, e.g. a file or folder, on the webserver. The path comes after the domain and is appended with a "/". Often, paths mirror the underlying structure of websites. When using APIs, the paths sometimes are called endpoints, it is the entry point to the service.
Parameters Additional information is often transported via parameters. Parameters are indicated with a "?" and follow the structure "key=value". A list of parameters is separated with a "&" between each key-value-pair. In the example above the URL refers to a YouTube video, whereby the value "dbTREHtu1O0" (the ID of a video) is assigned to the key "v" (which stands for the resource video).

Did you know? When interacting with webservers, for example to extract data, additional information is required besides the URL. This can be authentication data or details of the browser. This additional information is contained in the so-called headers. Facepager automatically compiles the headers and sends it with the request. Learn more about the headers and how to find them in a browser in the page about the Generic Module.

The structure of URLs is (more or less) uniform, so the individual components can be exchanged. In Facepager, these components can be replaced with placeholders.

Placeholders

Instead of using the full URLs the whole time, placeholders are used. These are very helpful, because they allow you to use data of your nodes. Thus, you don’t need to manually formulate URLs for every single node. In the Query Setup every text in angle brackets (like <Object ID>) is handled as a placeholder. You may use these placeholders in the Base path field, in the Resource field and in the parameter values.

For example, the URL https://twitter.com/TheAcademy/following links to all the accounts followed by TheAcademy on Twitter. Similarly, the URL https://twitter.com/HBO/following links to all the accounts followed by HBO. To get the list of followers for several accounts, the "path"-part of the URL referring to the Twitter account can be replaced by a placeholder: https.//twitter.com/<Object-ID>/following. When adding TheAcademy and HBO as seed nodes, Facepager automatically assembles the URLs for both accounts, as if you would type them in manually. The placeholder <Object ID> always refers to the first column in the data view.

There are three types of placeholders:

  1. The placeholder <Object ID> is always replaced by the Object ID of the node Facepager is currently fetching data for. The Object ID is given in the first column of the Nodes View.
  2. Placeholders may contain extraction keys addressing specific values in the data of a node. For example, the placeholder <cover.source> is replaced by the corresponding value from the data view.
  3. Furthermore, placeholders used in the Base path may be defined in the parameters. The name of the placeholder is given on the left side, the value it should be replaced with on the right side. For example, if the Base path is set to <page>/feed you should define the placeholder <page>, so that it is replaced with the Object ID. See the example below.

Power user hint: Angle brackets (< and >) have a special meaning in Facepager. If you literally need these brackets, escape them with a backslash (\<). If you need the backslash as well, escape it with another backslash (\\<).

Imagine you just added a new node "BobMarley" and you want to fetch the last 10 posts before the current date from the corresponding Facebook page. There are different equivalent ways to achieve this. One option would be to just type the URL with all needed parameters into the Resource field:

Please remind: These screenshots were taken on 25th of May in 2018. If you try this out yourself, adapt the date.
The parameters since and until do not always work reliable. They’re used here to demonstrate the meaning of placeholders. Furthermore, the APIs are changing constantly, so this approach may be outdated by the time you read this text.

While this works, it will always give you posts from the BobMarley page, even if you add other nodes to the Nodes View. The very same URL would be assembled with the following options, because the placeholder <Object ID> is replaced with the Object ID of the node under consideration:

<Object ID> is replaced by BobMarley

In contrast to this way, the suggestions in the dropdown list of the resource field do contain completely different placeholders like <page> or <user>. So if you select <page>/feed as resource, you have to choose a value, which replaces the placeholder <page>. Usually this will be <Object ID>:

<Object ID> is replaced by BobMarley. <page> is replaced by <Object ID>. So <page> is replaced by BobMarley.

The final URL generated to fetch the data for the three example settings is always "https://graph.facebook.com/v2.12/BobMarley/feed?until=2018-05-25&limit=10".

⚠️ **GitHub.com Fallback** ⚠️