Match the Following - datameet-pune/datameet-pune.github.io GitHub Wiki
Aim
To match and join two tables where the columns to join might have slight spelling differences.
Detailed Aim
Given two sets of tabular data (left, and right) and a common column specified, to run word-similarity scripts to return most likely matches from right side for each row on left side. With filtering by another column if any to reduce the sample size
Use case scenario
- An NGO has a table of 1000+ village names from their on-field survey, with taluka (sub-district) and district column and other information regarding their survey.
- To take their work forward, they have to link this with census data and map data and so need to find the 2011 census code for each village.
- The village names they have differ in spelling from their names in census records.
- There are some villages having same name, belonging to different talukas. So there is risk of matching the wrong villages having same name but being in different talukas. Just matching the village name fields won't do.. the user needs to review the other column values too for the proposed matching row to choose the right match.
- The village census data for Maharashtra is 44000+ entries long and so filtering it down to district/taluka level when doing the matching is an obvious necessity.
We encountered this situation in the Water Cup 2017 participating villages project. Ravikant used an array of methods to find the census codes, starting with direct name match, then word similarity algorithms, and finally for the few remaining villages that hadn't scored any matches, manual inspection. There could be other groups having a similar task. With some programming we could automate / streamline this process and make it possible for anyone to do it.
References
- https://www.tools4noobs.com/online_tools/string_similarity/
- https://asecuritysite.com/forensics/simstring
- https://github.com/EdwardMGonzalez/dldist
- http://crr.ugent.be/programs-data/fast-computation-of-average-levenshtein-distances-in-python-including-old20
- https://en.wikipedia.org/wiki/Approximate_string_matching
- http://ieeexplore.ieee.org/document/5360098/?reload=true
- http://www.mlsec.org/harry/
- some search strings: "bulk string matching", "levenshtein command line".. suggest more!
Concept diagram:
Update, 20 Aug 2018
Some libraries in JS and Python that can do this: