Match the Following - datameet-pune/datameet-pune.github.io GitHub Wiki

Aim

To match and join two tables where the columns to join might have slight spelling differences.

Detailed Aim

Given two sets of tabular data (left, and right) and a common column specified, to run word-similarity scripts to return most likely matches from right side for each row on left side. With filtering by another column if any to reduce the sample size

Use case scenario

  • An NGO has a table of 1000+ village names from their on-field survey, with taluka (sub-district) and district column and other information regarding their survey.
  • To take their work forward, they have to link this with census data and map data and so need to find the 2011 census code for each village.
  • The village names they have differ in spelling from their names in census records.
  • There are some villages having same name, belonging to different talukas. So there is risk of matching the wrong villages having same name but being in different talukas. Just matching the village name fields won't do.. the user needs to review the other column values too for the proposed matching row to choose the right match.
  • The village census data for Maharashtra is 44000+ entries long and so filtering it down to district/taluka level when doing the matching is an obvious necessity.

We encountered this situation in the Water Cup 2017 participating villages project. Ravikant used an array of methods to find the census codes, starting with direct name match, then word similarity algorithms, and finally for the few remaining villages that hadn't scored any matches, manual inspection. There could be other groups having a similar task. With some programming we could automate / streamline this process and make it possible for anyone to do it.

References

Concept diagram:

image


image


image


image


image


image


Update, 20 Aug 2018

Some libraries in JS and Python that can do this: