URLs, Placeholders, Nodes and Keys - strohne/Facepager GitHub Wiki
Facepager helps you to extract data from the web – by interacting with APIs, Webscraping or downloading (and uploading) media files. In all of the three cases, URLs are used to access the data.
The underlying principle for data extraction in Facepager, detailed below, is as follows:
- A resource on the web containing the data is addressed by an URL. Facepager assists in assembling the URL.
- Parts of the URL can be replaced with placeholders. This simplifies assembling URLs for a large number of resources.
- The placeholders are filled in Facepager with data from the query setup, nodes and keys.
- For uploading files – instead of extracting data –, a special kind of placeholder is used to access the content of a file.
URL is the short form of "Uniform Resource Locator". It refers to the unique address of a resource on the web – like an HTML webpage, an image or a video. Every URL consists of individual components that follow a certain structure:
In Facepager, the URLs are assembled from data in the Query Setup:
Name | Function |
---|---|
Protocol | The protocol determines how to transfer data of webpages via the Internet. HTTP is short for "Hypertext Transfer Protocol" and is the standard protocol across the Internet (as well as the secure version HTTPS). It is used to display the content of the webpage in the browser. Another Protocol is FTP, standing for File Transfer Protocol. FTP is used to transfer files, e.g. from the computer to the web. |
Domain | Domains are used to refer to the webserver hosting the website. A webserver can also be identified by an IP address, though using a domain name is easier than typing a sequence of numbers. Each domain consists of several parts. In this example, "www" is the subdomain, "youtube.com" is the domain and ".com" is the top-level domain. |
Base path | In Facepager, you can find the field "basepath" in the Query Setup. The basepath is assembled from the protocol and the domain. It can also contain parts of the path, e.g. when you work with a specific version of an API. Data is usually requested using different queries but from the same webserver. You can put the common part of the URLs into the basepath field. For example, https://www.youtube.com is the basepath, if you interact with YouTube or https://www.googleapis.com/youtube/v3 is the basepath for interacting with Version 3 of the YouTube API. |
Path to Resource | The path refers to a unique resource, e.g. a file or folder, on the webserver. The path comes after the domain and is appended with a "/". Often, paths mirror the underlying structure of websites. When using APIs, the paths sometimes are called endpoints, it is the entry point to the service. |
Parameters | Additional information is often transported via parameters. Parameters are indicated with a "?" and follow the structure "key=value". A list of parameters is separated with a "&" between each key-value-pair. In the example above the URL refers to a YouTube video, whereby the value "dbTREHtu1O0" (the ID of a video) is assigned to the key "v" (which stands for the resource video). |
Did you know? When interacting with webservers, for example to extract data, additional information is required besides the URL. This can be authentication data or details of the browser. This additional information is contained in the so-called headers. Facepager automatically compiles the headers and sends it with the request. Learn more about the headers and how to find them in a browser in the page about the Generic Module.
The structure of URLs is (more or less) uniform, so the individual components can be exchanged. In Facepager, these components can be replaced with placeholders.
Instead of using the full URLs the whole time, placeholders are used. These are very helpful, because they allow you to use data of your nodes. Thus, you don’t need to manually formulate URLs for every single node. In the Query Setup every text in angle brackets (like ) is handled as a placeholder. You may use these placeholders in the Base path field, in the Resource field and in the parameter values.
For example, the URL https://twitter.com/TheAcademy/following
links to all the accounts followed by TheAcademy on Twitter. Similarly, the URL https://twitter.com/HBO/following
links to all the accounts followed by HBO. To get the list of followers for several accounts, the "path"-part of the URL referring to the Twitter account can be replaced by a placeholder: https.//twitter.com/<Object-ID>/following
. When adding TheAcademy and HBO as seed nodes, Facepager automatically assembles the URLs for both accounts, as if you would type them in manually. The placeholder <Object ID>
always refers to the first column in the data view.
There are three types of placeholders:
- The placeholder
<Object ID>
is always replaced by the Object ID of the node Facepager is currently fetching data for. The Object ID is given in the first column of the Nodes View. - Placeholders may contain keys addressing specific values in the data of a node. For example, the placeholder
<cover.source>
is replaced by the corresponding value. You can look up the value in the Data View. - Furthermore, placeholders used in the Base path may be defined in the parameters. The name of the placeholder is given on the left side, the value it should be replaced with on the right side. For example, if the Base path is set to
<page>/feed
you should define the placeholder<page>
, so that it is replaced with the Object ID. See the example below.
Power user hint: Angle brackets (<
and >
) have a special meaning in Facepager. If you literally need these brackets, escape them with a backslash (\<
). If you need the backslash as well, escape it with another backslash (\\<
).
Imagine you just added a new node "BobMarley" and you want to fetch the last 10 posts before the current date from the corresponding Facebook page. There are different equivalent ways to achieve this. One option would be to just type the URL with all needed parameters into the Resource field:
Please remind: These screenshots were taken on 25th of May in 2018. If you try this out yourself, adapt the date. The parameters since
and until
do not always work reliable. They’re used here to demonstrate the meaning of placeholders. Furthermore, the APIs are changing constantly, so this approach may be outdated by the time you read this text.
While this works, it will always give you posts from the BobMarley page, even if you add other nodes to the Nodes View. The very same URL would be assembled with the following options, because the placeholder <Object ID>
is replaced with the Object ID of the node under consideration:
<Object ID>
is replaced by BobMarley
In contrast to this way, the suggestions in the dropdown list of the resource field do contain completely different placeholders like <page>
or <user>
. So if you select <page>/feed
as resource, you have to choose a value, which replaces the placeholder <page>
. Usually this will be <Object ID>
:
<Object ID>
is replaced by BobMarley. <page>
is replaced by <Object ID>
. So <page>
is replaced by BobMarley.
The final URL generated to fetch the data for the three example settings is always "https://graph.facebook.com/v2.12/BobMarley/feed?until=2018-05-25&limit=10".
Nodes are the objects of the data collection. This can be any object returned by an API – such as Facebook posts, Twitter tweets or YouTube comments. When doing webscraping nodes can be links or media files. Usually one object corresponds to one row. Here the terms row, object and node all refer to the same concept. We apologize for being somewhat unclear.
In Facepager, there are four different kinds of nodes. You can see the type of a node in the Object Type
column:
Object type | Explanation |
---|---|
seed | Seed nodes were manually added by you using the Add nodes button. |
data | The data returned from the API or from webscraping is sliced into single data nodes. For example, a data node is created for each tweet. |
offcut | The remaining part of the returned data, after cutting out data nodes, is put into an offcut node. Here you find, for example, data about the pagination. If you fetch multiple pages using the Maximum pages setting, one offcut node is created for each requested page. |
unpacked | You can slice your data later using the Extract data function. The created nodes have the object type "unpacked". |
For example, if want to collect the list of Twitter followers from the accounts "TheAcademy", "HBO" and "goldenglobes", these accounts are your nodes. You can add starting nodes (also called seed nodes) by clicking Add Nodes
in the Menu Bar.
Objects may contain other objects (they may be nested). For example, a Twitter account such as TheAcademy has followers. When you fetch the followers for this account one node for every follower is automatically inserted under the node "TheAcademy". In this example, the IDs of the accounts are contained in the Object ID-column. These new nodes are the "child"-nodes of your previously added "parent"-nodes. You'll notice - depending on your operating system - an arrow or plus sign besides the objects. Clicking it will unfold the object showing subordinated "child"-objects. Of course, objects may have multiple levels or relationships. Manually added seed nodes are on the first node level, while the child nodes are on the second or deeper level.
You can easily fetch data for multiple nodes on deeper levels without selecting every single node. First, click on the ancestor or parent node, no matter on which level. Second, increase the Node level
in the general settings section. If you, for example, want to fetch the followers of TheAcademy’s, HBO’s and goldenglobes’ followers, you aim at the child nodes on node level 2. Select their common parent node, set the node level to "2" and fetch the data. As a result, new nodes (the followers’ followers) are automatically inserted as new rows on node level 3.
Another useful concept of Facepager is addressing node data by keys. With keys you pull out data from the nodes. You can use this concept in placeholders or to define the columns of the Nodes View. The starting point is the data shown in the Data View. These data are formatted as JSON, which follows a quite simple logic of key-value pairs. To get the value on the right side you use the key to the left. Data may be arranged as a nested hierarchy. Nested key-value pairs are addressed by chaining the keys separated by a dot, e.g. comments.data
.
To quickly get a specific key you can select it in the Data View and click Add Column
. This will add the corresponding key to the Custom Table Columns (Column Setup) field right below the Data View. You can click Add All Columns
to add keys for all values at once. Nested data will be output as a JSON string containing all the data. For example, the key comments.data
gives you all items pasted together:
To get a single value you can use a key addressing deeper values. For example, the key comments.data.0.message
will give you the message content of the first comment only. For addressing multiple values, you may use the asterisk-operator *. Replace a key with the asterisk to address all keys on the same level. All values will be concatenated by semicolons. While comments.data.0.message
will only address the first comment, comments.data.*.message
will give you the messages of all comments, separated by semicolons.
This works the same way for other fields, e.g. comments.data.*.created_time
.
Remember, that only columns defined in the Column Setup are exported by Facepager. So keys relate to columns in the resulting Excel sheet, while values are the row-values in a single column.
A special kind of placeholders is used for working with file contents: <Object ID|file>
. By using the pipe operator | in conjunction with the file modifier the value of the placeholder is interpreted as a file name. You can use other keys instead of Object ID
. The placeholder is then replaced by the contents of the file. The filename is relative to the folder specified in the input field below the payload field (only visible in the Generic and Files Modules, set method to POST).
To upload the file you insert the placeholder into the payload field. If you need to upload files with base64-encoding feed the contents to the base64-modifier using another pipe: <Object ID|file|base64>
.
See the presets coming with Facepager for an example or read the Getting Started with Google Cloud Platform.