Home - DDMAL/linkedmusic-datalake GitHub Wiki

What is LinkedMusic?

Welcome to the Wiki for LinkedMusic, an open-source platform that combines detailed music metadata from many different databases into one searchable graph. It collects information about musical works, performances, recordings, composers, performers, and much more from various online databases and converts them into a unified RDF-based structure. This means you can access and explore relationships, history, and annotations across all these sources through a single endpoint.

To link our data, we reconcile entities to Wikidata using OpenRefine. Finally, all reconciled RDF data is loaded into a Virtuoso quad store, providing robust SPARQL querying performance and graph management. Eventually, this graph will be queried using a Large Language Model-based (LLM) interface, which will convert user inputs to a SPARQL query.

There is a production Virtuoso server running at https://virtuoso.simssa.ca/ and a staging server at https://virtuoso.staging.simssa.ca/. A McGill VPN is required to access the site remotely.

[UNDER CONSTRUCTION] Structure of This Wiki

This Home page explains the purpose of LinkedMusic and shows how the Wiki is organized. The following pages cover each part of our process in more depth:

  • Data Ingestion: How we gather raw metadata, whether through APIs, data dumps, or manual exports, and how we prepare it for the graph.
  • Reconciliation: How we use OpenRefine to reconcile entities with Wikidata.
  • Export & Loading: How we export reconciled data in RDF formats (Turtle, RDF/XML) and load it into our Virtuoso quad store.
  • Querying: Future plans for querying the database via our SPARQL endpoint and how we plan to use LLMs to convert natural language queries to SPARQL (NLQ2SPARQL)

Completed Work

Currently, we have finished reconciling the following databases to Wikidata:

In-progress:

Near future:

For a more detailed outline of future work, see the Future Work Wiki page.

Why LinkedMusic?

The Challenge of Scattered Metadata

Today, music metadata is stored in over a hundred independent databases like MusicBrainz, RISM, TheSession, university archives, and more. Each system has its own set of fields, unique identifiers, and access methods. As a result, musicologists wanting to conduct in-depth research face a number of challenges:

  • Fragmented Research: Scholars must switch between many interfaces and manually combine results to comprehensively study a single composer or piece.
  • Varying Standards: Different naming conventions (e.g., “Mozart, W.A.” vs. “Wolfgang Amadeus Mozart”) and data formats cause confusion and mismatches.
  • High Technical Barriers: Using SPARQL, parsing APIs, and handling bulk data dumps requires advanced technical skills.
  • Custom Pipelines: Comparing data across sources often needs custom scripts for each pair of databases, making large-scale analysis difficult.

Our Solution

Our methodology embraces a flexible, linked‑data‑driven data lake architecture:

  • Data Ingestion: We import complete metadata dumps and API feeds for each source into a Virtuoso‑based RDF data lake.
  • Robust Entity Reconciliation: Entities are semi‑automatically aligned to Wikidata using OpenRefine reconciliation services. Each canonical URI retains provenance metadata for auditability.
  • Federated Indexing with SESEMMI: While currently not yet implemented, the reconciled RDF graphs will be indexed in SESEMMI, our open‑source metasearch engine—enabling simultaneous, federated searches across LinkedMusic and external SPARQL endpoints.
  • Natural‑Language Query Layer: We will also use large language models to translate user inputs into SPARQL queries, supporting multilingual, culturally sensitive search and eliminating barriers for non‑technical users.

Benefits for Researchers

  • Comprehensive Coverage: Centralized access to metadata from hundreds of heterogeneous databases.
  • Data Quality & Provenance: Stable URIs and reconciliation audit trails ensure high‑fidelity, reproducible research.
  • Inclusive & Multilingual Search: Natural‑language queries combined with multilingual term mapping support culturally nuanced exploration.
  • Scalable Extensibility: New databases can be integrated with minimal configuration, allowing the graph to evolve alongside research needs.

With LinkedMusic, musicologists, librarians, and developers gain a unified, transparent, and user‑friendly environment for comprehensive metadata exploration and analysis.