Author Page Plan - acl-org/acl-anthology GitHub Wiki
From the continued discussion on #623:
The basic new idea is that we keep the current ID system. We pretend the current state of the Anthology is completely disambiguated. Every page currently under /people/
is a person who we assume is known and correct. There is no UUID for unknown people. There are people for whom we do and don't have ORCIDs, but that is just a tool to facilitate ingestions. It is not a first-order concept.
As Marcel noted, every paper currently points to a person, either explicitly (e.g., id=yang-liu-edinburgh
) or implicitly (every <author>
tag without an ID, where the ID is the name slug). We will make all of these explicit; every paper will look like
<paper>
<author id="matt-post">...
<author id="david-chiang">...
<author id="nathan-schneider-cuboulder">...
Essentially, the baked-in assumption is that in most situations, name == person
. This is actually a pretty safe assumption. Of the 371,280 <author>
tags in the Anthology, only 1,958 currently have an id
attribute: 0.5%. People actually seem to have pretty distinct names.
I think things get a lot simpler with this approach. Goals (1) and (2) below are met trivially: we are keeping the current system, which is name-centric, and uses no numeric IDs. We will not use numeric IDs in the future. What we are doing is updating the structures to support it.
We have two core tasks:
- Assigning papers to people. This is easy for any unambiguous name, and hard for ambiguous names. We want to fix that.
- Disambiguating people when we find a mistake. This is possible in our system, but it is cumbersome.
I think it will be helpful to work through every scenario from a workflow perspective. Here, I am careful to use our established terminology, distinguishing a name ("Yang Liu"), its slug (yang-liu
), and all the people who use that name (yang-liu
, yang-liu-edinburgh
, yang-liu-hk
, etc). Every person has an ID, comprising their canonical name slug and optionally some distinguishing decorator (yang-liu
, yang-liu-edinburgh
, etc). For the vast majority of Anthology entries, the name slug is the ID.
We have a set of new papers. One of the following situations will apply. A "name match" means the slugged name on the ingesting paper matches one in the Anthology.
- ORCID match. Easy. Assign the paper to the matched ID.
-
Name match, unambiguous. Easy, since
name==person
. If the ingested paper has an ORCID, the person is now known, and we add that info to the DB. - Name match, ambiguous. Hard. We default to the matching name slug (as we mostly currently do), resolve manually (as we theoretically currently do), or employ some ML. If the ingested paper has an ORCID, we assign it to the chosen name (in this situation, we could also filter out matching names that have ORCIDs).
- No match. Easy. We create a new person. If the ingested paper has an ORCID, that person is known.
Every situation here is straightforward except (3). However, we also expect it to be rare. (1) should be our most common situation. Anything exported from Open Review should no longer require manual disambiguation.
To support the above, we need a database of every name ID. I suggest a file, data/yaml/people.yaml
, which records them in the format Marcel suggested:
person-id:
- orcid: ...
- canonical-name: ...
first: ...
last: ...
- variants:
- first: ...
last: ...
- other info: ...
There is no need to split into "known" and "unknown" because this isn't the primary organizational principle. Every ID is a real person; ORCIDs are nice for ingestion, but they are secondary. We might also want to split other ways, e.g., data/yaml/people/{a.yaml,b.yaml,...}
, etc, which could help with handling merge requests. We might also record all known name variants for readability and quick loading purposes.
There is an issue with (3): we risk over-assigning papers without an ORCID to the person with the base name slug if we use that default assignment approach. I don't see how this is avoidable. To resolve this, we should push submitters to obtain ORCIDs, and we should build a small ML model that can help with this. It should not be difficult, from paper metadata alone (e.g., coauthors, titles, abstracts). I bet a customized but off-the-shelf LLM query could resolve this for us.
Note that, by the process above, the first person to ingest a paper with an ORCID under a name will "lock in" that ID to themselves. We can also backfill many of the papers in the Anthology with ORCIDs that were submitted with ingestion materials for the last several *ACL conferences.
We have the following kinds of disambiguation:
- One person, many names (handled by merging pages)
- One name, many persons (handled by creating a new ID)
Easy. We require the person to obtain an ORCID, for our own purposes. We then update the relevant paper IDs, set the canonical name, and extend the person's list of variant names.
Easy. This is still a manual process, but not that it's not that common. We require the person to obtain an ORCID, and get their canonical name variant (basically, we fill out the missing info). We then ask them for the institution of their highest degree, and use that to create a custom ID, e.g., (the nonexistent) matt-post-slu
. We then update the person IDs on all relevant papers.
I think both of these could be handled by a text search-box, similar to our "fix data" button:
- A popup would ask for the name, ORCID, terminal degree institution, and then a list of all Anthology IDs for their papers
- This would get used to create a Github issue via a PR template
There is an editorial decision related to a newly-created person ID (e.g., full or abbreviated name of the university). This can just be worked out based on aesthetics. I don't think it matters.
There can still be a conflict if two people with the same name have the same university. I would say we just deal with this if we come across it. We could have one of them choose a different institution.
We want to build a new author page format. This format should
- Center human names rather than IDs
- Fit within our current Github + static web page framework
- Make it easy to a. Consolidate multiple names under a single author profile b. Allow a particular name to be used by multiple authors
- Maximize user-serviceability
- Minimize demands on Anthology staff time
First, some terminology.
- A name is a surface string displayed on a PDF and recorded verbatim in a paper or talk's metadata.
- A person is a human being.
- A person may have multiple names, in which case each is called a variant, one of which is the canonical variant, typically selected by that person for current use.
We have the following problems:
- Name resolution. Names can be ambiguous in multiple ways: a name can be used by multiple people, and a single person can have multiple names. We therefore have an issue at ingestion time of determining which person a given name belongs to.
- Cumbersome disambiguation process. Our current resolution process is cumbersome and time-consuming. The IDs are assigned haphazardly, usually using an individual's Ph.D.-granting institution (e.g., Yang Liu/Edinburgh). The ID is largely hidden in our representation and is inconsistent in display (only some authors have a distinguisher). The process is hard to explain to users.
- Author pages. We have no permanent, consistent landing page for each author that we can point to. There is increasing demand for such a page from both people and from reviewing bodies.
Each author in the Anthology will have a page with the format aclanthology.org/people/{name}/{identifier}
. A person is known to the Anthology when we have their ORCID. This leaves two situations:
- If the person is known to the Anthology,
{name}
is their canonical name variant and{identifier}
is the ORCID. - Otherwise,
{name}
is the name found on a paper, and{identifier}
is a UUID that is designed to be uglier than an ORCID. - In certain situations (e.g., for prominent deceased authors), we may choose a more usable UUID (for example, we may wish to use
/people/aravind-joshi/1
).
The name page (aclanthology.org/people/{name}) will serve as a "disambiguation page. It will present a list of all known people with that name. It will not display any papers, in order to discourage people from using that page as their main author page.
We will add a mandatory id
attribute to every <author>
and <editor>
tag that identifies the person. This tag will have the form {name}/{identifier}
, based on whether the author is known or unknown.
We will maintain a file, data/yaml/canonical_names.yml
, whose keys are identifiers and whose values are canonical name variants, e.g.,
0000-0002-1297-6794: matt-post
In this section, we list how common operations will be carried out.
Ideally, ingestion materials will carry ORCIDs. Here is the process for handling that:
- If the ORCID is new, create an entry for it, mapping to the name's slug.
- If the ORCID exists, find its entry in
canonical_names.yml
and use that as the identifier. Note that this needs to be done iteratively within each ingestion, because we could have a new ORCID that maps to multiple variants. In this case, the first name will be adopted arbitrarily as canonical.
There really is no disambiguation problem any more. Every paper assigned to a known person should be correct. The name ambiguity problem should be solved. The remaining problem is name grouping, e.g., when a name is assigned to an unknown person via UUID.
Name grouping is when we group multiple names under a single person. This is always the result of an explicit request. In the new setting, we require an ORCID to do so. We can then update the id
attributes on all papers and create an entry in canonical_names.yml
to reflect the canonical variant.
We would ideally present a nice UI for this. It would be a dynamically updated text search box, similar to when you type addresses. It would allow a user to select all of their papers for grouping together under the new ID. This would take some work to work out.
TODO