EmailMatching - renepickhardt/related-work.net GitHub Wiki
Email Author Matching
For our website related-work.net we have extracte email addresses from a number of research papers. Furthermore we have for each of those papers a list of author names, but we dont know which of the found addresses belongs to which author. In this note we explain a possible approach to this matching problem.
Notation
Before we analyze the problem in more detail let us fix some notation.
- For a paper $p$ we denote by $A(p)$ the set of author names of $p$ and by $M(p)$ the set of found addresses.
- For a given email address $m$ we denot by $P(m)$ the set of papers which mention the address $m$.
- For a given author $a$ we denote by $P(a)$ the set of papers which are authored by $a$.
- We denote the total set of all papers by $PP$, the set of found email addresses by $MM$ and the set of all mentioned author names by $AA$.
Observations
- If a paper $p$ has a single author $a$ all mentioned email addresse $M(p)$ belong to $A(p)$
- If we have two papers $p,q$ with a single co-author $A(p) \cap A(q)=\{a\}$, then all addresses in the intersection $M(p) \cap M(q)$ belong to $a$
And so on for triple intersecitons etc.
Reformulation
If we make the assumption, that no two authors have the same email address, we can reformulate our problem as follows:
Find a map $f: \MM \lra \AA$ which maps each email address to the corresponding author name. In particular, for each paper $p$ all the mentioned email addresses $M(p)$ are mapped to the authors $A(p)$: \[ f(M(p)) \subset A(p). \]
Matching heuristic
Now let us look at a single address $m$. Which author $a$ should we assign to $m$?
Well, $a$ has to be a common author of all paper mentioning $m$ \[ a \in \bigcap_{p \in P(m)} A(p) =: A(m). \]
One of the following cases occures:
- The set $A(m)$ is empty.
- The set $A(m)$ contains precisely one element.
- The set $A(m)$ contains more than one element.
In case 2 we are happy, and map $m$ to the address $a$. This includes all cases in our observation.
In an ideal world, case 1 does not occure, since no two authors should have the same address. However, it might happen that the authors name is misspelled, he changed his name and not his email address or, god prevent, two authors do indeed use the same email account.
In case 3 there are several authors which authored all papers mentioning the addreess $m$. And we have to use additional heuristics in order to make an educated guess.
Treating case 1
Instead of looking at the intersection we can \[ a \in \Union_{p \in P(m)} A(p) =: B(m) \]