How To use Insomnia - openeduhub/oeh-search-etl GitHub Wiki

If the source you want to crawl metadata from is providing an API, your first step should be familiarizing yourself with the structure of the (meta-)data that the API offers. This how-to guide will show you basic steps on how to use the Insomnia REST Client to interact with a WordPress REST API and iterate through all learning objects that can be found at a source - in this case: material.rpi-virtuell.de.

By being able to use a REST Client, you should have a more comfortable experience while testing and debugging. Ultimately this should shorten the time needed to lay the foundations of an API-crawler by yourself.

First things first: For some tasks a browser might just be enough to help you get started with your scrapy crawler. If you want to skip this part, you can jump straight to our quick introduction to Insomnia.

Using a browser to interact with an API

While modern browsers like Firefox offer the capability to display the returned JSON object of a GET-Request as a human-readable overview instead of raw text, these browser-included developer-tools are fast, yet still kind of limited in their usability.

Interacting with a "wp-json"-API, in this example the WordPress-API of rpi-virtuell.de that can be found at https://material.rpi-virtuell.de/wp-json/, will typically look similar to this:

As you can see in the highlighted area (1) of the address bar, we're currently navigating at the root of the API: There's a bunch of namespaces that the API is offering (more on that can be found in the WordPress API documentation: Extension Discovery), but since we already know that the materials we're looking for are behind the mymaterial/v1-namespace, accessing that URL will be our next step:

We see that the only entry in routes points us towards https://material.rpi-virtuell.de/wp-json/mymaterial/v1/material which is also where we'll find the JSON object we're actually looking for. With default parameters our request returns a list of 10 JSON-objects, accessible by [0]to [9]:

Expanding one of the JSON-objects by clicking on the triangle next to its number will look something like this:

Bingo! We found the metadata that we want to extract with our crawler in neat key-value-pairs. This immediately leads us to the next question: How many items does the API offer in total and how many pages do we have to iterate through?

This is where the WordPress API Handbook comes in handy: We can use pagination parameters like ?page=: and ?per_page=: to modify our GET-Requests to suit our needs. Depending on the chosen ?per_page=:-parameter, the total amount of pages to iterate through until we've extracted all items varies. The WordPress documentation helps us with this particular problem:

To determine how many pages of data are available, the API returns two header fields with every paginated response:
X-WP-Total: the total number of records in the collection
X-WP-TotalPages: the total number of pages encompassing all available records

Taking a look into the Headers (1), we'll be able to spot these helpful fields (2):

If a single page holds 10 JSON-objects while the total amount of pages amounts to 1088, we'll be expecting about 10.880 JSON-objects (if we iterate through all available pages). Since the WordPress API allows us to request up to 100 items per page, we can cut down the pages we need to iterate through by a factor of ten.

Now this would be the starting-point of building our own crawler for this particular source, but debugging and testing our crawler's functionality manually with only a browser (e.g. by sending / saving GET-Requests with different parameters manually by copy-and-pasting the parameters into the address bar) ends up in a lot of unnecessary busy-work that is also prone to errors.

That is where dedicated clients for REST APIs come into play. Two of the most popular clients for API development and testing are:

Postman (Freemium, proprietary)
Insomnia (Freemium, open source)

You are not limited to these two software-tools, though - there are more available for this use-case (see: alternatives to Postman, alternatives to Insomnia via AlternativeTo.net).

Insomnia (REST client): A quick introduction

For UI-simplicity's sake, we chose Insomnia as our REST Client for this task. The desktop client (available for Linux, Mac and Windows) is open source software and uses a freemium payment model. After the installation of Insomnia, you'll be greeted by an almost empty dashboard, similar to this:

If you haven't used Insomnia before, your dashboard will contain a Design Document called Insomnia, which is there by default. Since we're not designing an API and instead we'll only be using Insomnia to make our debugging-life a bit easier, we will need a Request Collection instead. First, select the Create-Button (1) and create a new Request Collection (2).

Give your Request Collection a meaningful name, ideally the source your crawler will be working with, and click on Create. Inside your currently empty Request Collection you'll find a +-Button next to the Filter-Input-Field if you look closely.

This will open up the New Request-dialog:

Giving your Request a name is not necessary, but you need to be sure that you've selected the correct Method from the dropdown-button on the right-hand side. In our example, we'll be working with a GET-Request.

Enter the API URL, in our example https://material.rpi-virtuell.de/wp-json/mymaterial/v1/material, into the input-field (1) and click on Send. The returned object will be shown in the Preview-Panel on the right-hand side of the Insomnia user interface. The following image shows the same GET-Request displayed in Firefox and Insomnia (for comparison):

In the middle pane you'll be able to spot a Query-Tab, which is the time-saving feature that we'll be using to make our life easier while building a crawler that makes use of an API.

Modifying (and saving) query-parameters with Insomnia

Once you've made your first query by pressing the Send-Button in the middle pane, you might want to adjust or modify some query parameters. Click on the Query-Tab to bring the Query interface to the foreground:

The Query-Interface is divided into two columns: On the left-hand side you'll be able to adjust the parameters of your GET-Request, on the right-hand side you'll be able to set values for your chosen parameters. By using the [ ]-checkboxes, you can quickly toggle on and off a specific parameter to see the impact it has on the returned object.

This will be immensely useful if you need to quickly validate the gathered metadata from your crawlers and want to quickly compare the source material with whatever your scrapy.Spidergathered from a website. On the right pane you'll also find the Header-Tab, which will come in handy if your crawler works with Responses from the header.

Using JSONPath and Response filters for better overviews

While the last chapter focused on the query-interface, there's a really useful feature almost hidden in the Insomnia user interface just in the bottom right corner of the screenshot. Filtering the response with JSONPath expressions makes working with API responses a breeze.

In our example API of https://material.rpi-virtuell.de/wp-json/mymaterial/v1/material/ a GET-Request returns a list of 10 JSON-objects (accessible by [0] to [9]). If we only wanted to take a look at the fifth item in the list, we could filter the response with $[4] and take a closer look at the JSON-object.

Now if we wanted to get an overview of all the returned keywords on the current page, $[*].material_schlagworte would show us the nameand term_id of each element including their values.

Say we needed an overview of only the keywords, we could use the JSONPath expression $[*].material_schlagworte[*].name to display exactly that. This mighty tool comes in extremely handy for debugging your crawler while analyzing APIs and their (sometimes unexpected) responses.