ngWS P3 - bavla/biblio GitHub Wiki

Problem 3: Entity identification/resolution

Author names

Lorenzo Bartolini from Letters to Juliet

  • Many ways to write the name. Some data bases are trying to standardize the names (DBLP, ZB, ResearcherId). Chinese, 100 names.

    MathSciNet; Orcid - Enter author name in Search field
    Scopus; eLibrary - Click on author's name
    and take the number after authorid

https://orcid.org/0000-0002-0240-9446
https://elibrary.ru/author_items.asp?authorid=155240
  • Variations in the first names: Sort (last name, first name).

  • Multi-alphabet (names written in different languages) - convert names to selected alphabet or use "dictionary".

AMS approach

Journal names and Books

  • Form a key from initials of journal nameke and sort (key, journal name)

  • Books data
    International Standard Serial Number ISSN;
    Digital Object Identifier DOI;
    International Standard Book Number ISBN.

Keywords

Provided in data or extracted from the text (title, abstract). Key phrases.

  • Errors (typos) in the data base -- correct them in your copy of the data base data.

Synonyms, homonyms

  • Synonymy: Unit names meaning the same.
    Diferent names for the same person: Otfried Cheong; Michael Deza

  • Homonymy: Same unit names having different meanings.
    Same names for different people: authors

How to deal with:

  • normalization - at data entry

  • standardization - use standards whenever possible

  • "dictionaries"
    When the unit names are extracted from the text the so called stopwords are omitted. The equivalence is automatically determined using stemming or lemmatization.
    lemmatization lists / dictionaries
    python template
    keywords - keyword recommendations

  • for synonymy: sort labels of units, manually/visually identify equivalent units, create partition, (shrink) equivalent units.

  • for homonymy: correct the data in your copy of the data base.

  • ISI names: The usual ISI name of a work (field CR in WoS)

LEFKOVITCH LP, 1985, THEOR APPL GENET, V70, P585   

has the following structure

AU + ’, ’ + PY + ’, ’ + SO[:20] + ’, V’ + VL + ’, P’ + BP

In WoS the same work can have different ISI names. To improve the precission the program WoS2Pajek supports also short names. They have the format:

LastNm[:8] + ’ ’ + FirstNm[0] + ’(’ + PY + ’)’ + VL + ’:’ + BP   

For example: LEFKOVIT L(1985)70:585
From the last names with prefixes VAN, DE, . . . the space is deleted.

The best/final solution is to enter data in bibliographic data base in standardized way resolving homonyms.

From ESNAM 2

Another problem that often occurs when defining the set of nodes is the identification (entity resolution, disambiguation) of nodes. The unit corresponding to a node can have different names (synonymy), or the same name can denote different units (homonymy or ambiguity). For example, in a bibliography on mathematics from Zentralblatt MATH, the names Borštnik, N. S. Mankoč; Mankoč Borštnik, N.; Mankoč-Borštnik, Norma; Mankoč Borštnik, Norma Susana; Mankoc-Borstnik, N.S.; and Mankoč Borštnik, N.S. belong to the same author. On the other hand, in Zentralblatt MATH at least two different Smith, John W. are recorded, because publications of the author(s) with this name spanned from 1868 to 2007. There are at least 623 different mathematicians with the name Zhang, Li in the MathSciNet Database. Its editors are trying hard, from the year 1985, to resolve the author’s identification problem (Martin et al. 2013) during the data entry phase.

From ZB

appeared twice or even more times under different names in the network WA because of only the partial unification of their names. We made a partition of the set of authors by collecting different appearances of the same author. For example O’Regan, Donal is once written as oregan.donal and another time as o’regan.d. This author has a ZBunified name oregan.donal, but sometimes his unified name is not written and in such cases our program for the data conversion creates it from the full author’s name O’Regan, Donal and gets unified-like name o’regan.d. Another author with similar problem is Pečarić, Josip E. His unified ZB-name is pecaric.josip-e, sometimes unified name is not written and we get pecaric.j and pecaric.j-e because of two different writings of his full name: Pečarić, J. and Pečarić, J. E. Yet another source of problems is the writing of Eastern European surnames: Krachkovskij, A. P., and Krachkovskii, A. P. are probably representing the same author.

The partition of author’s names solved the unification problem only partially. We also used the AMS identification of authors (TePaske-King and Richert 2014) for help with the unification problem. All the following analyses were made after the additional unification of different appearances of the same author names. We also solved the problem with journals. Different names of the same journal were replaced by a single name—from 3158 journal names we obtained 2665 unique journal names.

From our 2014 book

Another problem occurring often when defining the set of vertices is the identification of nodes. The unit corresponding to a vertex can have different names (synonymy), or the same name can denote different units (homonymy or ambiguity). For example in the BibTEX bibliography from the Computational Geometry Database (Jones 2002) the same author appears under seven different names: R.S. Drysdale, Robert L. Drysdale, Robert L. Scot Drysdale, R.L. Drysdale, S. Drysdale, R. Drysdale, and R.L.S. Drysdale. Insider information is needed to decide that Otfried Schwarzkopf and Otfried Cheong are the same person. At the other extreme, there are at least 57 different mathematicians with the name Wang, Li in the MathSciNet Database (TePaske-King and Richert 2001). Its editors have tried hard, from 1985, to resolve the identification of authors problem during the data entry phase. In the future, the problem could be eliminated by general adoption of initiatives such as using ResearcherID or ORCID. Similarly in the WoS work references we find the following journal names: NUCLEIC ACIDS RES, NUCL ACIDS RES, NUCLEIC ACIDS RES S, NUCLEIC ACIDS RES S2, NUCL ACID RES, NUCL ACIDS RES S2, NUCL ACIDS S SER, NUCL ACIDS RES S, NUCL AC RES, NUCLEIC ACIDS RES S1, Nucleic Acids Res, NUCL ACIDS RES S1 or Q J R MET SOC, Q J R METEOROL SOC, Q J ROY METEOR SO S1, Q J ROY METEOR SOC, Q J ROY METEOR SOC B, QUART J ROY METEOR S, QUART J ROY METEOROL, QUART J ROY METEOROL SOC, QUART J ROYAL METEOR. The immediate issue with all of these names is whether they denote the same journal or a small set of journals. There exists International Standard Serial Number (ISSN 2013), an international system for the identification of serial publications and other continuing resources. The problem is that the convention is not considered in WoS in the list of work references. In resolving the journal identification problems, it is possible to use the Global Serials Directory (Ulrichsweb 2013) and Journal Abbreviation Sources (JAS 2013), and many other services and data sources. The identification problem appears also when the units are extracted from plain text parts of documents. In producing keywords from the title or abstract of a work, the unimportant ‘stopwords’ must be eliminated first. The remaining (real) terms (words or phrases) are usually standardized by replacing them by a ‘canonical’ representative. For example, terms ‘function’, ‘map’, ‘mapping’, and ‘transformation’ in the mathematics literature can be considered as equivalent terms. A similar problem is having equivalent terms from multilingual sources. To resolve this problem it is necessary to provide lists of equivalent terms or dictionaries. Yet another source of identification problems stem from the grammar rules of the language used in a specific document. For example the action, ‘go’ can appear in the text in a variety of different forms including ‘go’, ‘goes’, ‘gone’, ‘going’, and ‘went’. Resolving these grammar problems requires the use of stemming or lemmatization procedures from natural language processing toolkits such as NLTK (Bird et al. 2009; Perkins 2010) or MontyLingua (Liu 2004).

Cleaning the data

Given the construction of WoS files, there are errors which make it necessary to clean these data. The directed centrality citation network obtained by using the above procedures is labeled central.net. Getting the basic characteristics of the network, after it is read into Pajek, is done with the following commands:

Network / Info / General / Input 1 or 2 numbers [0] / OK

This network, for the 2010 version, has 548,600 vertices linked by 996,962 arcs. By definition, it cannot have loops (self citations) nor multiple lines (multiple citations). However, networks obtained from WoS can have both problems, because different articles can get the same WoS name. For example, part of the original WoS data contains:

GRANOVET.MS, 1973, AM J SOCIOL, V78, P1360
GRANOVETTER M, 1983, SOCIOLOGICAL THEORY, V1, P203

BORGATTI SP, 2002, UGINET WINDOWS SOFTW
BORGATTI S, 1999, UCINET V USERS GUIDE

CANTANZARO M, 2005, PHYS REV E, V71, UNSP 027103
CANTAZARO M, 2005, PHYS REV E, V71, UNSP 056104
CATANZARO M, 2005, PHYS REV E 2, V71, ARTN 056104

In all three groups, the name of the first author is written differently. In the final trio of names, the last pair show how one article can have different ISI names. Also, the use of the same short author name can represent different articles creating loops and multiple arcs in a citation network. These problems can be partially resolved in WoS2Pajek by introducing short names of articles. Details for doing this are provided in the WoS2Pajek manual. In principle, most of these inconsistencies can be detected and repaired, but this is a very time-consuming task, especially for networks of this size. We did some of this for the WoS data. In doing so, it was necessary to make a trade-off between the time taken and obtaining ‘clean’ data. It was useful to take a shortcut: these inconsistencies were considered as noise by removing the loops and transforming multiple arcs to single arcs.