Getting Started with KnowledgeGraph - strohne/Facepager GitHub Wiki

[This file is currently work in progress]

This Getting Started demonstrates how to fetch data from the Culture Knowledge Graph, how to collect meta-data from the German National Library, and how to combine and visualise the data in a basic network graph.

The Culture Knowledge Graph connects research data produced within the NFDI4Culture research landscape, improving the findability, accessibility, interoperability, and reusability of cultural heritage data. It brings together data from the Culture Research Information Graph and the Research Data Graph (for more information see this presentation). Data is provided as Linked Open Data and in part accessible via the NFDI4Culture Metadata RESTful API, but mainly via its SPARQL endpoint. The latter can be queried using the NFDIcore ontology as well as the NFDI4Culture ontology (cto) (which increasingly replaces the Culture Graph Interchange Format (CGIF)) in combination with further community standard ontologies.

From a content perspective, we will look at the letters of Ferdinand Gregorovius published by the German Historical Institute in Rome. The letters unveil Gregorovius' connections to his scientific peers in 19th century Europe and thus allow for a unique peek into the history of European science. The goal here will be to visualise Gregorovius' network by years. Of course, the preserved corpora only serves as an exemplary dataset to showcase one potential workflow that includes querying the Culture Knowledge Graph from within Facepager and linking the results with information from further databases.

To follow along, you will need to install the following software:

Depending on your previous experience with the software, learning about Facepager's basic concepts will certainly make it easier to follow.

DISCLAIMER: The Culture Knowledge Graph is work in progress. Currently, only a handful of datasets are accessible through its SPARQL endpoint using the NFDI4Culture Ontology. To get an up-to-date overview of the already integrated data feeds, please, see the Culture Knowledge Graph's public online dashboard. Furthermore, there will be user-friendly SPARQL Endpoint Explorer eventually. Yet, until published, a query containing only the url of a resource will result in a table describing all available metadata (see this example). The User-Policy and guidelines of the Culture Knowledge Graph are being worked on as well. For the time being we advise you to be mindful of potential query limits.

Part 1: Preparing the SPARQL query

In order to understand exactly what data we will fetch later on, it is certainly helpful to take a look at the SPARQL query that allows us to explore Ferdinand Gregorovius's connections to his scientific peers in 19th century Europe. The information we are looking for are stored in the Culture Knowledge Graph provided by NFDI4Culture. You can run the following query yourself at the NFDI4Culture's dedicated public SPARQL endpoint, however, you will at least want to read through part 2 of this Getting Started to better understand the goal of this query. If you haven't yet learned about the fundamental basics of SPARQL it might be best to check out our Getting Started with SPARQL before trying to wrap your head around the weird looking syntax below. If you are not interested in SPARQL and only want to understand more about how to apply presets in Facepager, you can (but shouldn't ;) ) skip to Part 2.

# Start by defining all prefixes needed to formulate the query
PREFIX cto: <https://nfdi4culture.de/ontology#>
PREFIX schema: <http://schema.org/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX n4c: <https://nfdi4culture.de/id/>

# Select the variables of interest
SELECT ?letter ?letterLabel ?date ?gnd_id
WHERE {

# Fetches all items from the 'Letters from the Digital Edition Ferdinand Gregorovius'
# dataset. All items are letters.
n4c:E5378 schema:dataFeedElement/schema:item ?letter .

# For each letter retrieve its label..
?letter rdfs:label ?letterLabel ;
		# ..its creation date..
		cto:creationDate ?date ;
		# ..and the URLs pointing to the GND entry of all persons mentioned within.
		cto:gnd ?gnd_url .

# Only return letters written in 1851. 
# In the Facepager version of this query "1851" will be replaced
# by the standard seed node place holder "<Object ID>".
FILTER(YEAR(?date) = 1851).

# Extracting the GND IDs from the GND URLs for further processing 
BIND(REPLACE(STR(?gnd_url), ".*://.*/(.*?)", "$1") AS ?gnd_id)

}
  • PREFIXES: The data we want to retrieve hides behind a number of different vocabularies. Note how the CultureKnowledgeGraph is based on its on ontologies cto and n4c alongside common vocabularies. If you are wondering how the vocabulary of a particular ontology is structured, it is usually worth looking directly through the documentation of a namespace. In order to understand which ontologies a graph database is based on in the first place, the documentation of the graph helps. Albeit unrelated to this query, the graph description provided by the team behind SemOpenAlex is comprehensive, beginner-friendly example of this.
  • SELECT: We are interested in the all letters from a specific year (?letter), its title (?letterLabel), date (?date), and the GND ID of all persons mentioned in the letter (?gnd_id).
  • WHERE: To make the context of the individual triples easier to understand, we have commented on the query line by line. Comments are indicated by #.
  • OPTIONAL: Allows optional patterns that do not necessarily have to be present in the graph. This is especially helpful, if you query for data that is not certainly available.
  • FILTER: The filter is at the heart of our query. We use it to fetch all letters from a specific year, in this case 1951.
  • ADVANCED FUNCTIONS: The last line of SPARQL uses the BIND and REPLACE functions as well as regular expression to extract a specific part of the URL stored in ?gnd_url, namely the GND ID and assign it to a new variable, ?gnd_id. That is done because we are not interested in the hyperlink (https://d-nb.info/gnd/118541951) but the id stored within (118541951).
  • SORTING PARAMETERS: We did not limit or sort the list of results, because we are interested in all results and will sort them later on. We do know, however, that the results won't exceed an unreasonable number and won't occupy the endpoint's full capacity.

That's it! This query returns all the information we need to continue our quest to visualise the social network of Ferdinand Gregorovius over the years. However, as you might already know from our Getting Started with SPARQL, issuing a SPARQL query using Facepager requires some minor adjustments because the software treats text embraced by arrow heads as placeholders (the standard one being <Object ID>). Wherever the query uses arrow heads, for example in the namespace section, we must, therefore, escape them with a backslash \. Here is the adjusted query:

PREFIX cto: \<https://nfdi4culture.de/ontology#\>
PREFIX schema: \<http://schema.org/\>
PREFIX rdf: \<http://www.w3.org/1999/02/22-rdf-syntax-ns#\>
PREFIX rdfs: \<http://www.w3.org/2000/01/rdf-schema#\>
PREFIX n4c: \<https://nfdi4culture.de/id/\>

SELECT ?letter ?letterLabel ?date ?gnd_id
WHERE {

n4c:E5378 schema:dataFeedElement/schema:item ?letter .

?letter rdfs:label ?letterLabel ;
		cto:creationDate ?date ;
		cto:gnd ?gnd_url .

# Note how the year 1951 was replaced by the standard placeholder <Object ID>.
# It will be later replaced again by the seed nodes definied in Facepager 
# allowing you to automatically run the query for several years at once.
FILTER(YEAR(?date) = <Object ID>).

BIND(REPLACE(STR(?gnd_url), ".*://.*/(.*?)", "$1") AS ?gnd_id)

}

Part 2: Get GND IDs of Ferdinand Gregorovius' addressees by years

Let's now (finally) turn to Facepager. We begin by fetching detailed information about the letters of Ferdinand Gregorovius from a specific year(s) stored in the Culture Knowledge Graph. The benefit of using Facepager to issue the query instead of the online SPARQL endpoint will reveal itself once we want to complement our initial fetch with data from other databases. Let's dive in:

  1. Create a database: Click New Database in the Menu Bar of Facepager to create a blank database. Save it in a directory of your choice.
  2. Setup the Generic module: From the Presets tab in the Menu Bar select and Apply the Knowledge Graph preset "1 Get GND IDs of Ferdinand Gregorovius' letter addressees". The Generic module in the Query Setup will refresh instantly. Notice that the base and resource paths are now set to call the SPARQL endpoint of NFDI4Culture. Additionally, two Headers have been placed. Accept specifies the data format returned by our query. We want the results to be returned in JSON-format. Content-Type prepares the query by requesting an unencoded SPARQL query string. As we will query via POST directly, it is crucial to set the Method to POST. For a more detailed explanation of all adjustments, check the SPARQL chapter from the Triply API documentation. At the heart of our data collection lies the SPARQL query that we have developed in part 1 of this tutorial. When the presets gets applied, the prepared query loads directly into the Payload box. Again note, that is has been optimised for Facepager by escaping all common arrow heads.

  1. Add nodes: Before fetching data, you will need to provide one or more seed nodes which will fill in the placeholder mentioned above during the actual request. To do so, select Add Nodes in the Menu Bar. In the open dialogue box enter one or more years (e.g. 1851, 1853, etc.). If you have a closer look at the SPARQL query (see part 1), you will find that your seed nodes will fill in a placeholder (<Object ID>) within a filter restricting the results to the years specified. Include as many nodes (years) as are of interest to you.

  1. Fetch data: Select one or more seed nodes, then hit Fetch Data at the bottom of the Query Setup. Facepager will now fetch data based on your setup. Once finished, you can inspect the data by expanding your seed node or clicking Expand nodes in the Menu Bar. For more detail, select a child node and review the raw data displayed in the Data View to the right. The information here can be a little messy. To get a comprehensive overview of only the data interesting to you, play around with the Column Setup to define what information will be displayed in the Nodes View.

Because we are interested in the addressees of Ferdinand Gregorovius, what we want to be displayed is the date of any given letter from the specified year, the letter's label, and the names of all persons addressed or mentioned by Gregorovius. However, these names are glaringly missing. They simply are not contained within the data we fetched. Instead, what we do find are GND IDs of Gregorovius' relations. GND IDs refer to the "Gemeinsame Normdatei" (Integrated Authority File) ID. It is used primarily for the unique distinction of entities library catalogs and allows to manage and link bibliographic records in Germany. Fortunately for us, the German National Library provides Entity-Fact sheets which we can use to quickly translate the GND IDs stored in the Culture Knowledge Graph by applying a second preset. This is where Facepager's strong suit comes into play.

  1. Apply second preset: To do so, from the Presets tab in the Menu Bar select and Apply the Knowledge Graph preset "2 Translate GND ID to a person's name". The Generic module in the Query Setup will refresh instantly. Now, the base and resource paths are set to call the Entity Facts API of the German National Library (DNB). Further, a new placeholder is introduced as a parameter. The key parameter specifies what information will be returned. The preset at hand only asks for the preferredName associated with a GND ID. See this overview to get an idea of the data provided by the Entity Facts API. Please, mind the German National Library's [Terms of Use](https://www.dnb.de/EN/Professionell/Metadatendienste/Datenbezug/geschaeftsmodell.html).

  1. Add nodes: Before fetching data, usually you would have to provide one or more seed nodes which would then fill in the placeholder mentioned above during the actual request. However, this time this is not to be done manually. Notice that the Object IDs of all child nodes from our first data collection already match the GND IDs we are interested in.

  1. Fetch data (again): Therefore, simply select all child nodes instead and Fetch Data again. Facepager will fetch data based on your setup once more. Once finished, you can inspect the data by expanding your child node or clicking Expand nodes in the Menu Bar. For more detail, select one of the new child nodes and review the raw data displayed in the Data View to the right. Again, the information here can be overwhelming. For now, we are only interested in the preferred name of a person. The preset has already adjusted the Column Setup accordingly. You have now successfully translated the GND IDs into more meaningful names. Our dataset is complete.
  2. Export data: Expand all nodes and select the ones you want to export. Hit Export Data to get a CSV-file. Notice the options provided by the export dialogue. You can open CSV files with Excel or any statistics software you like.

If you want to prepare and clean the data to eventually visualise them in a network graph, continue to follow the last two sections of this Getting-Started.

Part 3: R script

If you wish to continue by visualising the data you just collected, the next step is all about data preparation using RStudio. The R script below serves as a jump-start and instantly creates a nodes list and an edges list both needed for the network analysis later on.

  1. Begin by creating a new RStudio project.
  2. Move the CSV file (ckg_gregorovius.csv) into the project directory.
  3. Create a new R script, paste the following code, then run the script to produce two new CSV files containing nodes and edges. They will be saved in your project directory. Make sure to set the working directory to the source file location via the Session tab in the menu bar.
# load packages and data
library(tidyverse)

data <- read_csv2("ckg_gregorovius.csv")

# prepare data
# - convert parent_id from string to numeric to enable left_join (as.numeric)
data$parent_id <- as.numeric(data$parent_id)

# - filter out irrelevant rows  (filter)
# - merge rows where id and parent_id match across rows (left_join)
# - combine values, rename & select relevant columns (mutate & select)
# - remove redundant rows by dropping NAs (drop_na) 
data <- data %>%
  filter(object_type == "data") %>%
  left_join(data, by = c("id" = "parent_id"), suffix = c("", ".y")) %>%
  mutate(
    year = year(date.value),
    addressee = coalesce(preferredName, preferredName.y)
    ) %>%
  select(
    id,
    parent_id,
    gnd=object_id,
    year,
    addressee,
    letter=letterLabel.value
    ) %>% 
  drop_na()

# edges
# - join parent row to every row  (left_join)
# - select and rename columns  (select)
# - remove duplicates  (distinct)
edges <- data %>%
  left_join(data,by="letter") %>% 
  filter(gnd.x != gnd.y) %>% 
  select(source=gnd.x,target=gnd.y) %>%
  distinct() %>%
  na.omit()

# nodes
# - select and rename columns (select)
# - remove duplicates (distinct), omit if frequency is of relevance
nodes <- data %>%
  select(ID=gnd,Label=addressee, year) %>% 
  distinct()

# save nodes and edges for Gephi
write_csv2(edges,"gregorovius_edges.csv",na = "")
write_csv2(nodes,"gregorovius_nodes.csv",na = "")

Part 4: Gephi

You are now ready to visualise the data in Gephi and get an intuitive idea of Ferdinand Gregorovius' social network of the 19th century. Gephi is of course not your only option. R and RStudio, for example, let you build network graphs as well.

  1. Import data: Find the Import spreadsheet button in the Data Laboratory tab to first load the nodes list and then the edges list. When navigating the dialogue box, make sure that you import the nodes list as a Nodes table and the edges list as an Edges table. Select Append to existing workspace at the end of the import.
  2. Organise network: Head back to the Overview tab to see your nodes distributed randomly in a dense cloud. Run the layout algorithm Fruchterman Reingold to kick-start your network visualisation. In this layout, the more frequent nodes are connected the closer they are positioned closer to each other. Take some time to play around with settings. Further exploration of the network might be enable by altering the colour or size of nodes in the Appearance section or by applying time-based filters.
  3. Export network graph: Select the Preview tab to export your network graph either as SVG, PDF, or PNG. Here, you will also find some additional options to finalise the appearance of your graph.

There is a wealth of resources available online to get a more profound idea of the available features in Gephi. Start by looking up tutorials on YouTube about network visualisations as well as analyses and check out briatte's Awesome Network Analysis resource list on GitHub. Or, if German does not scare you off, have a look at Jünger and Gärtner's (2023) open-access introduction to Computational Methods for the Social Sciences and Humanities (Chapter 3 focusses on data formats, Chapter 10 on network analyses).

What's next?

⚠️ **GitHub.com Fallback** ⚠️