Current Pipeline for Adding a New Dataset - DDMAL/linkedmusic-datalake GitHub Wiki

1. Retrieve a data dump of the target dataset

Because each new target dataset is unique, retrieving a data dump may not be straightforward. Some datasets will offer data dumps on their website or in an easily accessible place (like GitHub), but others may require the use of a web scraper/crawler. It may be prudent to contact the administrators of the database to see if they are willing to help procure a data dump.

2. Reformat the data dump as needed

We currently prefer working with data dumps in CSV format, as it is easy to reconcile through OpenRefine. When possible, it is recommended to obtain or convert any new data dumps into CSV.

However, for larger and/or more complex datasets, other file formats may be necessary (such as JSON-LD, TTL, etc). In these instances, custom methods must be used for Step 3 (reconciliation) and Step 4 (RDF conversion). See the DIAMM and MusicBrainz documentation for examples of custom methods for reconciliation and conversion to RDF.

3. Reconcile the data to Wikidata

We currently use OpenRefine to semi-automatically reconcile items to Wikidata entities (QIDs) for most datasets. OpenRefine also provides the option to export operation histories, allowing users reapply transformations previously done by someone else.

Note: larger and/or more complex datasets may require custom methods for reconciliation.

Refer to Data Reconciliation Guidelines for general notes, tips, and tricks. Refer to OpenRefine Tips for OpenRefine-specific information.

4. Convert the reconciled data to RDF

This repository contains rdfconv, a set of scripts to help property matching and RDF conversion.

tomlgen.py will automatically generate a TOML file based on a folder of reconciled CSV files. This generated TOML file must then be manually updated with appropriate properties to be used for RDF conversion. Refer to RDF Conversion Guidelines for general notes, tips, and tricks on property matching. Note that not every item in the generated TOML will require (or benefit from) a property match.

convert.py can then be used to process the original CSV files and, using the associated TOML file, will generate a valid TTL file for importing to Virtuoso.

Note: larger and/or more complex datasets may require custom methods for RDF conversion.

Refer to RDF Conversion Guidelines for general notes, tips, and tricks.

5. Import the RDF files to Virtuoso

Refer to Importing and Updating Data on Virtuoso for a step by step guide on importing and updating data on Virtuoso.

6. Generate a visual of the new subgraph ontology

Refer to Visualizing the Graph Relationships for a step by step guide on how to generate a visual of a graph ontology.

7. Review the visual, perform test SPARQL queries, update as needed

In order to ensure the ontology makes sense, it is important to carefully review the graph visual and RDF property mappings for consistency and clarity. Because one goal is to standardize the ontology, make sure each new subgraph's ontology follows the mappings laid out by previous subgraph mappings when possible.

Once finished, add the new subgraph visual to LinkedMusic Ontology Separated by Subgraph.

8. Generate a visual of the full LinkedData ontology

Refer to Visualizing the Graph Relationships for a step by step guide on how to generate a visual of a graph ontology.

Make sure to chain all subgraph ontologies when using RDF Grapher to generate the visual.

Update The Full LinkedMusic Data Lake Ontology with the new full ontology.

9. Update the NLQ2SPARQL context

Because the NLQ2SPARQL Context contains the full graph ontology, it is important to update it with the addition of each new subgraph. Add the RDF generated by the SPARQL command used to generate the graph ontology before the last line of the context.

Remember to also add the database and its IRI to the list of graphs in the section beginning with "Here are the <n> databases currently in LinkedMusic, and the IRIs for their RDF graphs:". The sentence is built off the following template:

All triples for <database name> are stored in the <<graph iri>> graph, and their entity types use the `<prefix>` prefix.

Don't forget to increment <n>.

10. Create and validate sample queries relating to the new subgraph

For further testing, it is important to update Sample LinkedMusic Queries with queries for the newly added subgraph addressing all four challenges.

11. Update the Project Status page to reflect the addition of the new dataset

Add the new dataset to the list of Completed Work and remove it from the Datasets in-progress or Datasets to ingest in the near future list, as needed.

Each new dataset is listed using the following format:

[<name>](<link to source>) - <short description>
- <n> RDF triples
- [visualization](<link to visualization>)
- [documentation](<link to documentation>)

Use the SPARQL command found in Enumerating RDF Triples to generate a count of the number of triples within each dataset currently in the data lake.

Don't forget to update the total number of triples at the top of the page!

⚠️ **GitHub.com Fallback** ⚠️