Getting Started with Wikidata - strohne/Facepager GitHub Wiki

[This file is currently work in progress]

This tutorial introduces you to the steps necessary to construct a basic network linking German writers of several literary movements through places relevant to their lives and work using freely available data from Wikidata. The emerging spatial cluster potentially tells us about mutual influences of writers and possibly even movements. Does the emergence of new literary movements, for example, coincide with geographical shifts mirroring the beginning of a new zeitgeist? Ideally, keeping this aim in mind will not only help you comprehend the following undertaking more easily but make you think of your own (advanced) questions.

Starting of, you will learn how to obtain all data needed using Facepager. After successfully exporting the data, in the second part of this tutorial, you will find a R script that will allow you to prepare a node and an edge list ready to be imported into Gephi. Finally, we provide a short overview how to visualise the network using Gephi.

Always ensure to comply to Wikidata's User-Agent policy and be mindful of their query limits.

To follow along, you will need to install the following software:

Depending on your previous experience, it will take some time to familiarise yourself with these applications. However, getting to know them step-by-step will eventually enable you to create the kind of networks you are interested in.

At the heart of our inquiry lies SPARQL, a semantic query language used to access and retrieve data stored in RDF (Resource Description Framework) databases such as Wikidata. If you want to learn more about SPARQL, check out our grounds-up introduction over at Getting Started with Knowledge Graph. Making an effort to learn the basics of its functionality or to dive deeper into its syntax, will not only allow you to literally read any SPARQL triplet query but equip you with the skills necessary to transfer your knowledge to many useful applications. If you plan on working with Wikidata more often in the future, you might want to have a look at Wikidata's own introduction to SPARQL.

Part 1: Facepager

A network consists of nodes and edges. The latter represent the relations between individual nodes. Always think about how you want your network to look in the end before collecting any data. Here, for example, nodes are supposed to display the names of writers, places, and literary movements all coloured distinctively. Let's start by extracting the data we are interested in from Wikidata and exporting it using Facepager:

  1. Create a database: Click New Database in the Menu Bar of Facepager to create a blank database and save it into the directory of your choice.
  2. Setup the Generic module: From the Presets tab, again to be found in the Menu Bar, choose and apply the Wikidata preset "Writers Network". This will fill in the Generic module in the Query Setup automatically. The present contains the URL of the Wikidata Query Service, dictates the output format, and, most notably, includes a SPARQL query. The query determines what information Facepager will fetch from Wikidata. Although it might look daunting at first, understanding its syntax (SPARQL) is necessary if you plan on customising the preset. Fortunately, Wikidata provides a great introduction to SPARQL and its query service. Further, Wikidata's own query builder serves as a beginner-friendly starting point should you want to build your own basic query from scratch. Also, minding Wikidata's access best practices, please, keep the preset headers.

  1. Add nodes: Now you are ready to add one or more literary movements as a node. You can do so by selecting Add Nodes in the Menu Bar. You can find the identifier of a literary movement by searching for it on Wikidata. Simply copy the Q-number of the result into the open dialogue box in Facepager.

  1. Fetch data: Fetch the data by selecting one or more seed nodes and hitting Fetch Data afterwards. You can inspect the data by expanding your node or clicking Expand nodes in the Menu Bar. For more detail, select a child node and review the raw data displayed in the Data View to the right. On hitting Fetch Data Facepager will run the preset SPARQL query against Wikidata's database. If you take a close look, you will find that the query first collects all subjects which are part of the specified movement and which are occupied as writers and wrote in German. We then ask for all places relevant to the remaining subjects including their place of birth and death as well as where they were educated and employed.
  2. Export data: Expand the nodes and select all the nodes you want to export. Hit Export Data to get a CSV file. Notice the options provided by the export dialogue. You can open CSV files with Excel or any statistics software you like.

Part 2: R script

As a next step, you will need to prepare the data in RStudio. The R script below allows you to create a nodes list and an edges list both needed for the network analysis later on.

  1. Begin by creating a new RStudio project.
  2. Move the CSV file (writers_network.csv) into the project directory.
  3. Create a new R script, paste the following code, then run the script to produce two new CSV files containing nodes and edges. Make sure to set the working directory to the source file location via the Session tab in the menu bar.
# Load packages and data
library(tidyverse)

data <- read_csv2("writers_network.csv")

# Prepare data
# - filter out irrelevant rows  (filter)
# - select relevant columns (select)
data <- data %>%
  filter(object_type == "data") %>%
  select(
    id,
    parent_id,
    object_id,
    writer=writerLabel.value,
    place=placeLabel.value,
    movement=movement.value
    )

# edges
# - select relevant columns (select)
# - keep only one mention per data point (distinct)
edges_writer_place <- data %>% 
  select(source=writer,target=place) %>% 
  distinct(source,target)
edges_writer_movement <- data %>% 
  select(source=writer,target=movement) %>% 
  distinct(source,target)
edges <- bind_rows(edges_writer_place,edges_writer_movement)

# nodes
# - select relevant columns (select)
# - keep only one mention per data point (distinct)
# - add column to determine type (for use in Gephi)
nodes_writers <- data %>%
  select(id=writer) %>%
  distinct() %>% 
  mutate(type="writer")
nodes_places <- data %>%
  select(id=place) %>%
  distinct() %>% 
  mutate(type="place")
nodes_movements <- data %>%
  select(id=movement) %>%
  distinct() %>% 
  mutate(type="movement")
nodes <- bind_rows(nodes_writers,nodes_places,nodes_movements)


# Save nodes and edges for Gephi
write_csv2(edges,"writers_edges.csv",na = "")
write_csv2(nodes,"writers_nodes.csv",na = "")

Part 3: Gephi

If you made it this far, you should now be ready to start visualising the data in Gephi and perform a rudimentary network analysis. Gephi is of course not your only option. R and RStudio, for example, let you build network graphs as well.

  1. Import data: Find the Import spreadsheet button in the Data Laboratory tab to first load the nodes list and then the edges list. When navigating the dialogue box, make sure that you import the nodes list as a Nodes table and the edges list as an Edges table. Also select Appending to existing workspace at the end of the import.
  2. Organise network: Head back to the Overview tab to see your nodes distributed randomly. Choose the layout algorithm Force Atlas 2 to start off your network visualisation. Don't forget to Run the selected simulation. In this layout, connected nodes are positioned closer to each other. Take some time to play around with the settings. You can pull nodes apart by using a higher scaling or avoid overlapping nodes by checking Prevent Overlap. Further exploration of the network might require you to calculate network measures in the Statistics section to the right or alter the colour or size of nodes in the Appearance section.
  3. Export network graph: Select the Preview tab to export your network graph either as SVG, PDF, or PNG. Here, you will also find some additional options to finalise the appearance of your graph.

There is a wealth of resources available online to get a more profound idea of the available features in Gephi. Start by looking up tutorials on YouTube about network visualisations as well as analyses and check out briatte's Awesome Network Analysis resource list on GitHub. Or, if German does not scare you off, plunge straight into Jünger and Gärtner's (2023) thorough open-access introduction to Computational Methods for the Social Sciences and Humanities (Chapter 3 focusses on data formats, Chapter 10 on network analyses).

What's next?

  • Learn about the basic concepts of Facepager.
  • In Facepager itself you will find further presets that allow you to call up various freely accessible Knowledge Graph databases using SPARQL. Simply, check out our Preset category "Knowledge Graph".
  • Look at similar a tutorial showing how to prepare a network analysis of related YouTube videos oder lerne mithilfe des Getting Started with Knowledge Graph wie du SPARQL nutzen kannst um weitere graph databases mit Facepager abzurufen.
⚠️ **GitHub.com Fallback** ⚠️