Data Reconciliation Guidelines - DDMAL/linkedmusic-datalake GitHub Wiki
General Notes
- These guidelines were written with the main goal of creating a first pass of the datalake. As such, the focus is on broad, not thorough reconciliation.
- Our current focus is on gathering as many databases as possible into our datalake, so focus on automatic over manual reconciliation.
- These guidelines were written after initial reconciliation exploration, and so any reconciliation done before the creations of these guidelines may not match all guidelines. Going forward, it is recommended to follow these guidelines for consistency.
- In the future, it is recommended to base WikiData reconciliation off of MusicBrainz's reconciliation, as it is a large, already-reconciled database, and so can provide a good baseline for standardization. MusicBrainz has also already compiled lists of instruments and music genres.
OpenRefine Tips and Tricks
Note: Here is a good tutorial for OpenRefine reconciliation
1. Splitting columns
- Determine how to split columns based off of the data type/predicate used, if possible
- For CSV files, this also means one property per column
- For multi-valued cells, use
Edit cells > Split multi-valued cells...
- Columns may be split using a regex as the separator
- Use
|
(pipe) as the "or" operator between separators - Make sure to select "regular expression"
- Use
2. Preserving original data before pre-processing and reconciliation
- This step is done after splitting any multi-valued cells
- It is performed before pre-processing so that during RDF conversion, the original value can be used if the data is unreconciled
- Before reconciling a column, duplicate it using
Edit column > Add column based on this column...
- Name the column by using the original column name and adding "_original"
3. Choosing QIDs for reconciliation
- Make sure you are in "rows" mode when reconciling!
- Start with the most specific applicable category to reconcile against
- If performing multiple passes, re-reconcile against only remaining unreconciled cells
- Use
Reconcile > Facet > By judgment
and select "None" to filter for only unreconciled cells
- Use
- Be careful when reconciling against no particular type, especially if you do not have a relevant property to aid you
4. Using properties for reconciliation
- Some columns may contain data useful to reconciling other columns, such as RISM IDs or ISO values
- It is also possible to add columns based on other columns to aid in reconciliation
- Do not use too many properties though, as each property lengthens reconciliation time and does not inherently provide value
Note:
country
(P17) is different fromlocated in the administrative territorial entity
(P131) in WikiData
5. Matching data
- In general, err on the conservative side and use
Best candidate's score
ranges of 99-100 for automatic reconciliation - Make sure to scan each reconciled column to see if it makes sense to broaden the score range
- Depending on the QIDs reconciled against, OpenRefine's judgment of a score may vary
6. Exporting the data
- We currently export the data as CSV files in preparation for RDF conversion
- Use
Export > Custom tabular...
- For each reconciled column, select
Matched entity's ID
to preserve QIDs - Make sure to select CSV as the download option
- Before downloading the file, copy and save the export JSON code for future use
- For each reconciled column, select
7. JSON history and export files
- OpenRefine allows users to save reconciliation history and export history JSON data that can later be imported and automatically executed
- For each reconciled file, save both of these JSON files for future use
After reconciliation
- See RDF Conversion Guidelines for converting reconciled CSV files into the desired RDF format
TO ADD
- Do not fill down
- Prepend URI or prefix
(Reminder for Sichen to add these points later)
Other Miscellaneous Points (to be reworked)
Detecting Redundant Properties Across Files
Multiple CSV downloads or exports from the same source can sometimes contain properties whose values overlap. For example, two spreadsheets might each include a "label_name" column for releases, and 2 releases might have the same label_name value.
This matters because there might be inconsistent values, where one release's label_name value might be slightly different from another's, despite being the same label. These inconsistent values would then affect reconciliation by duplicating efforts, and potentially causing subtle differences in mappings due to the different values.
To handle this, follow these steps:
- Compare column names and data types across all CSV files for a given entity or dataset.
- Consolidate overlapping columns: Choose one canonical column to represent a property (e.g., "release_label"), and ignore or merge others.
- Record exceptions: If two columns are semantically distinct (e.g., "label_name" vs. "label_organization"), keep both but clarify their difference in a metadata note.
Handling Ambiguous Entity Types
- Example: In TheSession, "recording" refers to an album, which is
wd:Q482994
. In Wikidata, "recorded music" iswd:Q49017950
. - MusicBrainz has both "recorded music" and "album" as separate concepts. Choose the Wikidata class that best matches your local meaning.
Reconciling Entities (Q Items)
Determine Entity Class/Type
-
Identify each entity's class using SPARQL on interim RDF:
# Check the entity's type ?entity wdt:P31 ?wikidataClass .
-
Map local "type" columns (if present) to appropriate Wikidata subclasses or instances.
- If a CSV header lists "type" (e.g., "SeriesType"), treat it as a subtype of the main entity.
- Rename ambiguous header names (e.g., "type" in MusicBrainz Series) to
typeOfContainedEntity
to avoid confusion with RDF'srdf:type
.
Choose Preferred Wikidata Item
For the initial data ingestion phase, do NOT spend more than ~1 minute on any single item. If the correct entity is not immediately apparent, leave the entry as a string literal, but document the item. You should have a running document of unclear entities. These will be investigated more thoroughly at a later stage. If you end up making a judgement call, you should record this as well. It may be useful to keep in mind that when multiple candidates match, the Q-ID that's more widely used and more stable tends to be correct. This is often the smaller Q-ID.
Record Keeping
- Create a reconciliation sheet with columns:
local_id
,local_label
,chosen_QID
,reason_for_choice
. - For controlled vocabularies (e.g., MusicBrainz label types like "Bootleg Production," "Original Production," etc.), record each term's mapping explicitly.