Data Reconciliation Guidelines - DDMAL/linkedmusic-datalake GitHub Wiki

General Notes

These guidelines were written with the main goal of creating a first pass of the datalake. As such, the focus is on broad, not thorough reconciliation.
Our current focus is on gathering as many databases as possible into our datalake, so focus on automatic over manual reconciliation.
These guidelines were written after initial reconciliation exploration, and so any reconciliation done before the creations of these guidelines may not match all guidelines. Going forward, it is recommended to follow these guidelines for consistency.
In the future, it is recommended to base WikiData reconciliation off of MusicBrainz's reconciliation, as it is a large, already-reconciled database, and so can provide a good baseline for standardization. MusicBrainz has also already compiled lists of instruments and music genres.

OpenRefine Tips and Tricks

Note: Here is a good tutorial for OpenRefine reconciliation

1. Splitting columns

Determine how to split columns based off of the data type/predicate used, if possible
- For CSV files, this also means one property per column
For multi-valued cells, use Edit cells > Split multi-valued cells...
Columns may be split using a regex as the separator
- Use | (pipe) as the "or" operator between separators
- Make sure to select "regular expression"

2. Preserving original data before pre-processing and reconciliation

This step is done after splitting any multi-valued cells
- It is performed before pre-processing so that during RDF conversion, the original value can be used if the data is unreconciled
Before reconciling a column, duplicate it using Edit column > Add column based on this column...
- Name the column by using the original column name and adding "_original"

3. Choosing QIDs for reconciliation

Make sure you are in "rows" mode when reconciling!
Start with the most specific applicable category to reconcile against
If performing multiple passes, re-reconcile against only remaining unreconciled cells
- Use Reconcile > Facet > By judgment and select "None" to filter for only unreconciled cells
Be careful when reconciling against no particular type, especially if you do not have a relevant property to aid you

4. Using properties for reconciliation

Some columns may contain data useful to reconciling other columns, such as RISM IDs or ISO values
- It is also possible to add columns based on other columns to aid in reconciliation
Do not use too many properties though, as each property lengthens reconciliation time and does not inherently provide value

Note: country (P17) is different from located in the administrative territorial entity (P131) in WikiData

5. Matching data

In general, err on the conservative side and use Best candidate's score ranges of 99-100 for automatic reconciliation
Make sure to scan each reconciled column to see if it makes sense to broaden the score range
- Depending on the QIDs reconciled against, OpenRefine's judgment of a score may vary

6. Exporting the data

We currently export the data as CSV files in preparation for RDF conversion
Use Export > Custom tabular...
- For each reconciled column, select Matched entity's ID to preserve QIDs
- Make sure to select CSV as the download option
- Before downloading the file, copy and save the export JSON code for future use

7. JSON history and export files

OpenRefine allows users to save reconciliation history and export history JSON data that can later be imported and automatically executed
For each reconciled file, save both of these JSON files for future use

After reconciliation

See RDF Conversion Guidelines for converting reconciled CSV files into the desired RDF format

TO ADD

Do not fill down
Prepend URI or prefix
(Reminder for Sichen to add these points later)

Other Miscellaneous Points (to be reworked)

Detecting Redundant Properties Across Files

Multiple CSV downloads or exports from the same source can sometimes contain properties whose values overlap. For example, two spreadsheets might each include a "label_name" column for releases, and 2 releases might have the same label_name value.

This matters because there might be inconsistent values, where one release's label_name value might be slightly different from another's, despite being the same label. These inconsistent values would then affect reconciliation by duplicating efforts, and potentially causing subtle differences in mappings due to the different values.

To handle this, follow these steps:

Compare column names and data types across all CSV files for a given entity or dataset.
Consolidate overlapping columns: Choose one canonical column to represent a property (e.g., "release_label"), and ignore or merge others.
Record exceptions: If two columns are semantically distinct (e.g., "label_name" vs. "label_organization"), keep both but clarify their difference in a metadata note.

Handling Ambiguous Entity Types

Example: In TheSession, "recording" refers to an album, which is wd:Q482994. In Wikidata, "recorded music" is wd:Q49017950.
MusicBrainz has both "recorded music" and "album" as separate concepts. Choose the Wikidata class that best matches your local meaning.

Reconciling Entities (Q Items)

Determine Entity Class/Type

Identify each entity's class using SPARQL on interim RDF:

# Check the entity's type
?entity wdt:P31 ?wikidataClass .

Map local "type" columns (if present) to appropriate Wikidata subclasses or instances.
- If a CSV header lists "type" (e.g., "SeriesType"), treat it as a subtype of the main entity.
- Rename ambiguous header names (e.g., "type" in MusicBrainz Series) to typeOfContainedEntity to avoid confusion with RDF's rdf:type.

Choose Preferred Wikidata Item

For the initial data ingestion phase, do NOT spend more than ~1 minute on any single item. If the correct entity is not immediately apparent, leave the entry as a string literal, but document the item. You should have a running document of unclear entities. These will be investigated more thoroughly at a later stage. If you end up making a judgement call, you should record this as well. It may be useful to keep in mind that when multiple candidates match, the Q-ID that's more widely used and more stable tends to be correct. This is often the smaller Q-ID.

Record Keeping

Create a reconciliation sheet with columns: local_id, local_label, chosen_QID, reason_for_choice.
For controlled vocabularies (e.g., MusicBrainz label types like "Bootleg Production," "Original Production," etc.), record each term's mapping explicitly.

To Elaborate On

Processing dates and datetimes to be conform to xsd:date and xsd:dateTime requirements