ID Matching - WorldBank-Transport/tz-data-cleaning GitHub Wiki

One of the advantages of having a unique identifier for each item in a dataset is unique IDs can simplify the process of auto-connecting or auto-merging multiple datasets for various use cases. IDs can also help to improve consistency of data.

For example if you have three datasets related to schools, first one with schools performance information, another with facilities available on each school and another with geographical locations (coordinates),

  • To simplify analysis on a correlation between performance and available facilities per school you may need have to merge the performance and facilities dataset.
  • To build a visualization of schools performance and facilities distribution on a map you may need to merge these datasets with a datasets which can provide locations.

The task merging these different schools datasets to allow more analysis can be simpler if they both have unique identifiers.

Adding IDs to a dataset based on IDs from another dataset

Sometimes not all datasets will contain common IDs so to connecte them together you may need to add IDs to them based on another dataset. Deciding to introduce IDs to a dataset containing few records it can be reasonably done manually but introducing IDs to a big dataset with several hundreds or thousands entries an automated process can be very useful. Automatic addition of existing IDs to a dataset can be easy or tricky depending on the content of the dataset where IDs are extracted from and target dataset where Ids need to be introduced.

Tools like Google fusion tables, Libre office and MS Excel can be used to auto-merge IDs into a dataset by using another unique field which is common between a dataset without IDs and another dataset with IDs.

If there is no single common and unique field that can be used for matching IDs, multiple fields can be combined to act as a common unique criteria in this process.

With reasonable effort a custom program or script can be used to introduce IDs based on combination on multiple fields as a common unique property between a dataset without IDs and a dataset where IDs can be extracted.

Reconcile-csv and Openrefine

Open refine together with reconciliation service called Reconcile-csv can be used for merging IDs by using a combination of multiple field as a unique common property among two datasets.

Reconcile-csv is a reconciliation service for OpenRefine running from a CSV file. It uses fuzzy matching to match entries in one dataset to entries in another dataset, helping to introduce unique IDs into the system - so they can be used to join your data painlessly.

To use reconcile-csv you’ll need:

  • OpenRefine - a fantastic tool to clean messy data
  • Java - if OpenRefine runs on your machine, you already have it
  • Two datasets you want to join that contain some columns where you can figure out whether items are the same.

For more information on how to use reconcile-csv you can check the reconcile-csv dodumentation.

Sample custom script for adding IDs schools dataset

Custom script can also be used in adding unique identifiers to a dataset. Using programming modern languages like Python and sometimes with libraries like Pandas software developers can create scripts to accomplishing various data analysis tasks more efficiently.

One of the examples is this script named merge_school_codes.py availble here which can be used to add unique identifiers to a csv dataset of schools based on common school names, districts and school type.

You can use the script via command line interface by running

python merge_school_codes.py <non-coded.csv> <coded.csv> <output-fields-file>

Which will give output to a file named <non-coded.csv>-out.csv Where <non-coded.csv> represents a path to csv file which you want to introduce a unique identifiers, <coded.csv>: path to csv file where IDs can be extracted from and '': path to a text file that defines list of output fields each separated by a newline.
In order to run the script you will need to install Python, MongoDB & Pymongo. Both datasets used must contain name, district and school_type columns and additionaly the scripts expects the unique identifiers to be in a column named code.

Scripts like this can also be applied to a dataset before using Openrefine to the eliminate entries which match directly without need for fuzzy matching process which usually takes longer time compared to direct matching.

⚠️ **GitHub.com Fallback** ⚠️