Using a Named Authority Table - agile-humanities/ddhi-encoder GitHub Wiki
DDHI employs linked open data to enrich its transcription corpus. A named-entity recognizer (Spacy) is used to perform initial tagging of people, places, organizations, and events where they occur in the interview; student encoders correct and extend this initial tagging. The DDHI then uses OpenRefine to link these named entities with entries in the Wikidata knowledge base. Linked data like this can be used to enrich the DDHI’s metadata (by automatically supplying geo-coordinates for known places, for example) and to unify references to common named entities across the corpus.
Wikidata does not, of course, contain entries for every named entity, real or imagined. Because it is an open platform, anyone may add an entity to Wikidata, provided it is notable. But most narrators will refer to people and places that are neither in the Wikidata authority database nor eligible to be placed there. These are names the project must manage itself: encoders must create unique identifiers for the named entities (like Wikidata’s Q-names) that may be mapped to individual occurances of names in the text of the interview, and which may be used across the corpus, whenever that entity is mentioned.
Later iterations of the DDHI Drupal tool will include a named-entity manager, but until it is developed, the DVP will employ a table-based manual method commonly used in libraries to manage name-authority metadata. This is a quick, agile solution, designed to enable the students to continue their work and manage project-specific named entities as quickly as possible.
This method is described below.
The DVP-in-DDHI project will use a shared Google Spreadsheet to maintain the name authority file (NAF). Agile has created the table here. Please do not alter the layout of this table: do not change the names of columns or their order If you feel you must create additional columns for any reason, please consult with Agile.
While editing a transcript’s name list in OpenRefine, create a new column named dvp_id. Be sure to name it properly.
- from the pull-down menu at the top of the QID column, select “Edit column/add column based on this column…”
- you will be presented with a dialog box. Beside new column name enter dvp_id (no quotation marks, no capital letters); in the box beneath the word expression, enter the word null. Select ok at the bottom of the dialog box to create the column.
Now, inspect the QID column for empty values. (You may sort the column to put all the blank values at the top if you like.) The rows with no QID correspond with names that could not be found in Wikidata and could not be created there because they do not meet Wikidata’s inclusion criteria). For each of these “unauthorized names”, the students will do the following:
-
Has the entity already been assigned a project identifier?
Students will look in the Google spreadsheet authority table. If
the name has already been identified from another interview, the
student will use the project id for that name and enter it into
the dvp_id column.
Things to watch out for. Encoders must be on the alert for ambiguity. The “Jane Smith” in one interview may or may not be the same as the “Jane Smith” or the “Jane” in another. Encoders must read the interviews to discover the context of these mentions in order to resolve ambiguous co-references. Sometimes it will not be possible to resolve an ambiguity: from the interviews, it is not possible to determine that the name “Jane” refers to the same person. In those cases, the encoder should assume they are not homonymous and create separate IDs for each one.
Encoders should also be careful to avoid false homonyms. The “Jane Smith” in one interview may not be the same person as the “Jane Smith” in another.
-
If not, create a new entry in the NAF. The first three columns
are required.
- enter a new identifier. Please use the following procedure to
determine the new identifier:
- find the last-created identifier (the one with the highest number)
- add 1 to it
- enter the new identifier in the ID column
The format of the id should be the following:
dvp_nnnn
where n is a digit. For example:
dvp_0001
- enter authorized form of the name. Library catalogers have strict and elaborate rules for establishing authorized names; the DVP should establish its own rules and strive to be consistent.
- enter the type of entity being named, using the TEI ontology:
- person
- place
- org
- event
- enter a brief description. A single sentence that provides context for someone encountering the name. E.g., “Dartmouth alumnus, class of 1968” or “Mother of Fred Smith, Dartmouth class of 1968” or “Landing strip near someVillage”
- enter values in the remaining columns as appropriate and known. For example, enter geo-coordinates for places.
- enter a new identifier. Please use the following procedure to
determine the new identifier:
- Record the new identifier in the dvp_id column in the OpenRefine sheet.
When every item in the OpenRefine sheet has either a QID or a dvp_id, export the project as TSV. Bryan will run a ddhi_encoder script to merge the named-entity data into the TEI transcription.
Agile will provide a command-line script that may be used to convert this Google spreadsheet into a TEI document (the preferred form).