Author Page Plan - acl-org/acl-anthology GitHub Wiki
First, some terminology:
- A name is a surface string displayed on a PDF and recorded in a paper's metadata.
- A person is a human being that may have published under one or more names. When we use the word author, we always mean a person.
- A (person) ID is an internal identifier that we use to refer to a person, and which we expose via the author page URLs. Identifiers are currently of the form
{first name}-{last name}-{optional disambiguator}
, where our policy is to use the person's institution of highest degree as the disambiguator, if needed (e.g.yang-liu-blcu
).
- Name resolution. In our metadata, we currently record names (and sometimes affiliation), but on the website we want to expose pages for persons. It's not trivial to map between names and persons: a name can be used by multiple persons, and a single person can have published under multiple names.
- Disambiguation process. At the time of writing, there are 389 open issues for corrections to author pages. We have a backlog up to April 2024 (= 14 months). The current process is time-consuming and cumbersome because we need to manually assign IDs to disambiguate an author, and sometimes also make decisions for other instances of the name(s) of the affected person.
- Author pages. Our current landing pages for authors are based on their internal IDs (usually a slug of their name), but these can change if we need to disambiguate authors. This means there is no (guaranteed-to-be) permanent, consistent landing page for each author that we can point to. There is increasing demand for such a page from both authors and reviewing bodies.
- ORCID-centered system for identifying persons. We want to move to using ORCIDs as the main piece of information for disambiguating persons. ARR has been collecting ORCIDs via OpenReview for a while, but we have not recorded this information so far. The main advantage is that ORCIDs correspond to a person, so at ingestion time, we do not have to heuristically and/or manually resolve names (provided they come with an ORCID).
- Persistent, name-centered URLs for authors. Author pages for people whose identity has been established (e.g. through an ORCID) should have persistent URLs. We want to continue to center human names, rather than numerical identifiers, in our author page URLs. Besides keeping the "human element", this also helps with backward compatibility.
- Backward compatibility. We want, to the extent possible, to not break existing URLs that people are already using e.g. in OpenReview. We have also already manually disambiguated many authors based on previous requests, and we do not want to lose this information even if those disambiguations did not record ORCIDs for those authors.
Our XML metadata currently records names, which our Python library resolves to persons. Some key points of the current functionality are:
- We use heuristics¹ to automatically map "similar" names to the same person. For example, the names "Ludek Muller" and "Luděk Müller" are automatically resolved to the same person. This is done based on the empirical observation that e.g. names with diacritics do not always use them consistently in our metadata, but in our experience usually refer to the same person.
- We have a YAML file for recording known name variants that is used to map multiple names to the same person.
- We can manually assign IDs in our metadata to disambiguate persons with the same name.
While this provides the functionality we need to disambiguate authors, it is hard to reason about because of the different mechanisms at play, cumbersome to work with, and does not make use of ORCIDs. (Seven ORCIDs are currently recorded in the name variants file, but they are not currently used for anything.)
¹Concretely, we map names that produce the same slug to a single person.
-
Explicit distinction between verified and unverified persons. We introduce the following conceptual distinction:
- A verified person is a person whose identity has been established, either through an ORCID or through manual disambiguation that we have performed in the past.
- An unverified person is a person that has been inferred automatically from a name; this implies that there might be mistakes: the name might actually belong to another (verified or unverified) person, or the different instances of that name might actually belong to different persons.
-
Explicit IDs in the XML for all verified authors, and only verified authors. This makes it clear in our XML which authors are verified. It is slightly different from our current system in that
- we so far haven't always assigned an explicit ID when mapping 2+ names to a single person (as this could be expressed in
name_variants.yaml
without requiring an explicit ID), and - we have sometimes assigned explicit IDs to unverified authors to distinguish them from verified ones (for example, this Dan Zhang).
- we so far haven't always assigned an explicit ID when mapping 2+ names to a single person (as this could be expressed in
-
New database for recording all verified persons. This database replaces our current
name_variants.yaml
. We will store all verified authors in a YAML file (e.g.people.yaml
) with a format like the following:{person-id}: - orcid: ... names: - first: ... last: ... - first: ... last: ... other info: ...
The
{person-id}
corresponds to the ID we record on individual author tags in the XML. The first entry in thenames
list will be considered the canonical name, which is the name that appears at the top (and in the title) of the author page. Theorcid
entry will be used at ingestion time to assign papers to authors in this database.
Our long-term goal is to have as many authors as possible be "verified" and thereby be explicitly recorded in the people.yaml
file, together with an ORCID whenever possible.
-
We still need some kind of name resolution. When an explicit ID is recorded in the XML, the person is assumed to be verified and must appear in
people.yaml
, so there is no ambiguity. However, we will still need to implicitly resolve names to persons when no explicit ID is recorded for them, because:- It won't be feasible to verify all authors in all existing proceedings.
- Non-ACL venues might continue to deliver new material to us without ORCID information, in which case we cannot link any authors there to explicit IDs.
- Many authors will have a mix of papers with and without IDs. Authors that appear with an ORCID in new ingestion materials will be considered verified and get an explicit ID, but many of them will have existing publications in the Anthology without an explicit ID.
This means that there is a trade-off to be made here: We want to use ORCIDs to disambiguate persons, but we also do not want to break things for most people by putting their papers with and without ORCIDs on separate pages. This motivates the following approach.
When we encounter an author without an explicit ID, we need to resolve their name to a person. To do this, we match their name against the verified persons in people.yaml
. "Matching" will be done by comparing name slugs.² How we resolve the name to a person depends on the outcome of this matching:
- If there is no match, we create a new unverified person instance. We derive an ID for this person of the format
unverified/{name-slug}
. The slash makes it clear that this ID was implicitly created, because slashes are disallowed in explicit IDs. - If there is exactly one match, there are two possibilities:
- By default, we resolve this name to the existing, matching person in
people.yaml
. This will prevent breaking things for people whose name is currently not ambiguous, i.e. most people. However, as it has the potential to introduce mistakes (= the paper does not actually belong to that person), we will visually indicate on the author page that this particular paper-to-author assignment is not verified. - If the existing, matching person in
people.yaml
has a flag set (disable_name_matching: true
), we resolve the name tounverified/{name-slug}
instead. This will prevent mistakenly assigning papers to this person if we already know that there is at least one other person with their name. In other words, it will allow an author to tell us which of the papers are actually theirs, and we will only need to assign an explicit ID to those papers, without requiring us to ID the others (since the latter is a lot of work and requires us to find out the institution of highest degree, as per our ID format). It will then also prevent future publications from those "other" persons with that name from showing up on the verified person's page.
- By default, we resolve this name to the existing, matching person in
- If there is more than one match, we always resolve the name to
unverified/{name-slug}
. Again, this will prevent mistakenly assigning a paper to someone who is already known to have an ambiguous name.
²A note on name matching: Using slugs as the basis for matching is closest to what we currently do, and will keep conflating e.g. "Muller" and "Müller" automatically. Alternatively, we could do the matching based on names, which would remove the heuristics and make it easier to infer what is happening, but this would change some of the current author pages by e.g. inferring two different persons for "Ludek Muller" and "Luděk Müller".
The approach proposed above means that we can end up with both verified and unverified person instances for the same name, e.g., a yang-liu
who is explicitly defined in people.yaml
, and an unverified/yang-liu
who was implicitly created from a paper that has the name "Yang Liu" without an explicit ID.
We propose that:
- An
unverified/yang-liu
's papers will appear under a URL likehttps://aclanthology.org/people/unverified/yang-liu/
. - A verified
yang-liu
's papers will appear under a URL likehttps://aclanthology.org/people/yang-liu/
. - If there is only an
unverified/yang-liu
without an explicit, verifiedyang-liu
in our data (= the "no match" case in the proposed name resolution logic above),https://aclanthology.org/people/yang-liu/
will be a temporary redirect tohttps://aclanthology.org/people/unverified/yang-liu/
. This means that existing URLs will continue to function even if the author is not in ourpeople.yaml
file, but redirecting to/unverified/
URLs could discourage authors from using such URLs on e.g. OpenReview, and encourage them to provide us with an ORCID and verify that the papers are actually theirs.
This approach means that author page URLs that do not include or redirect to an /unverified/
URL are guaranteed to be persistent, while /unverified/
URLs may disappear or change at any time.
At ingestion time, we have a set of new papers, each with a set of authors. One of the following situations will apply:
-
ORCID in the ingestion material, matches an ORCID in our
people.yaml
. We assign the paper to the ID with the matching ORCID, adding it explicitly to the XML with anid="..."
attribute. -
ORCID in the ingestion material, no match in our
people.yaml
. We create a new entry inpeople.yaml
with the name and ORCID, generating a new ID for it that we add to the XML. To generate the ID:- We will generate the name slug and use this as the ID if it doesn't exist as an explicit ID yet. This is important for the proposed name resolution logic to "not break things" for most authors.
- Otherwise, we will use the name slug plus the last four digits of the ORCID to generate the ID, e.g.
yang-liu-045X
. The author may be able to change the ID later to a more semantic one, like e.g.yang-liu-edinburgh
, but this approach allows us to automatically generate IDs without manual intervention being required during ingestion.
- No ORCID in the ingestion material. We simply record the name without an ID in the XML; the name resolution logic described above will decide which person this will be resolved to.
-
Merging pages (i.e., an author has published under multiple names that are wrongly shown on separate author pages). We set the
id="..."
attribute to that person's ID on each of their papers, adding an entry inpeople.yaml
if it doesn't exist yet. We always ask the person to provide their ORCID, so that future ingestions will be able to correctly assign their papers to them. -
Separating pages (i.e., an author page lists papers that actually belong to different people). Under this proposal, this should only happen for papers that have no explicit ID recorded in the XML yet. These will either be on
/unverified/
pages, or on pages with "exactly one match" under the name resolution logic. We create a new entry inpeople.yaml
for the author reporting this mistake, ask them to provide their ORCID, setdisable_name_matching: true
, and set theid="..."
attribute to that person's ID on each of the papers actually belonging to them.
To create new IDs, we:
- Use the person's name slug if it doesn't exist as an explicit ID yet, e.g.
yang-liu
. - Use the person's name slug plus "their institution of highest degree" otherwise, e.g.
yang-liu-edinburgh
. This is consistent with how we have previously created these explicit IDs.
This means that the first person to have an explicit ID created for their name will "lock in" that ID (e.g. yang-liu
) to themselves, while other persons with the same name will need a disambiguator appended to it.
We use our current name_variants.yaml
and the explicitly assigned IDs in the XML to infer verified persons under the new system. For this, we need to:
- Write explicit IDs to the XML for every person currently inferred through
name_variants.yaml
. -
Remove explicit IDs from the XML for IDs that are currently "catch-all" entries, i.e. those that normally say "May refer to multiple people" in
name_variants.yaml
. - Write all inferred verified persons to the new
people.yaml
.
This makes sure we don't lose the manual disambiguation work we have already done. Notably, no new papers will be assigned to these persons unless we also add an ORCID for them, as the ingestion will only match by ORCID. However, if these people do provide us with an ORCID at some point, it is as easy as adding a line to people.yaml
— their IDs will stay the same, their author URLs will continue to work, and future ingestions with their ORCID will correctly assign their papers to them.
We can also back-fill ORCIDs for a number of recent conferences where ORCIDs were already provided in the ingestion materials, but simply not used by us up until now.
- Optionally, we could also query ORCIDs API to find (more) Anthology papers. We would not do this as a regular part of ingestion or build, but more as a one-time thing to pre-populate the database.
In the Python library, the changes primarily affect the acl_anthology/people/
submodule. Concretely:
-
Person
objects need to get anorcid
field. -
PersonIndex._load_variant_list()
should be removed. -
PersonIndex.load()
should load the newpeople.yaml
before callingself.build()
, andself.build()
should no longer call_load_variant_list()
. -
PersonIndex.get_or_create_person()
is the place where the new name resolution logic needs to be implemented, and needs to be refactored accordingly. - We probably want a
PersonIndex.find_by_orcid()
. - For the ingestion logic, I believe it would be preferable if
PersonIndex.get_or_create_person()
would also incorporate the name resolution logic there, which means it also needs to optionally accept an ORCID. If this turns out to be too messy, we can move it into a separate function. -
PersonIndex.save()
should be refactored to save the newpeople.yaml
.
Since the Python library includes a lot of test cases, including a toy version of the data files, I would suggest a test-driven development approach where we:
- Use the steps under "Transitioning of metadata" on the toy data files.
- Update existing test cases and create new ones that test the different cases under the name resolution logic and ingestion logic.
- Implement the changes.
After the changes are implemented:
- Documentation should be updated where appropriate; maybe check for references to
name_variants.yaml
etc. and update accordingly. - We should release these changes as a new major/minor version on PyPI, since these are breaking changes w.r.t. parsing the Anthology data files.
- Make sure
bin/create_hugo_data.py
writes out ORCIDs and whether the ID was explicit or not for each person. - Make sure
bin/create_hugo_data.py
writes out for each author on a paper whether it had an explicit ID set or not. (This means checking if theid
attribute was set on theNameSpecification
and might already exist, not sure.)
The redirection logic could maybe also be implemented in bin/create_hugo_data.py
by writing page stubs that redirect to the /unverified/
version of an author page where appropriate.
- The author page template should show a person's ORCID, if available. We should probably try to follow ORCID brand guidelines wherever reasonable, i.e. show the ORCID logo and the full ORCID URL.
- A paper entry on an author page should indicate visually if it is verified or not (= the XML entry had an explicit ID or not). The exact visual appearance is still up for debate, but could be one or more of the following:
- Mark verified entries with some icon, e.g. a green checkmark which shows a popup on-hover to explain what it means.
- Mark unverified entries with some icon, e.g. a grey question mark which shows a popup on-hover to explain what it means.
- Make unverified papers have subtly greyed-out/desaturated colors.
We have a separate slide deck containing some mockups for this.
Based on https://github.com/acl-org/acl-anthology/issues/623#issuecomment-2940518999 via @mbollmann.
We want to keep the current name-centric ID system to uniquely identify a person. However, in the current state of the Anthology, names are not completely disambiguated with regard to which person they refer to, leading to hundreds of requests from authors to disambiguate or merge their author pages.
We currently have partial disambiguation in the form of explicitly assigned IDs and our name_variants.yaml
; we want to keep this information, but move towards ORCIDs as the main factor for disambiguation. In the following, I will use the term "verified" to refer to all persons whose identity has been established (e.g. through an ORCID), in contrast to "unverified" persons who we instantiate based solely on their name.
In the XML, we assign IDs only to verified authors, i.e. when we are certain of the identity of this person. This is similar to what we do now, except that (i) so far we don't always assign an explicit ID when we merge 2+ names onto a single person, and (ii) the ID assignment will be primarily based on ORCID data in the future. We will store all verified authors in a file like data/yaml/people.yaml
, which records them in the format suggested in v2, including ORCIDs if we have them.
This approach means that we will always be able to infer from the XML if an author identity has been verified or not. It also means that we still need to implicitly map authors without IDs to a person. To do this, we take the author's name on the paper, let's say "Yang Liu", and check if there are any matching persons in data/yaml/people.yaml
. "Matching" could mean that the slugified version of their name already exists in our data (this is closest to what we currently do, and will e.g. conflate names like "Muller" and "Müller"); alternatively, that their name already exists as a canonical name in our data (this does not rely on heuristics and makes it easier to infer what is happening).
We distinguish three scenarios:
-
If there is no match, we create a new person instance. Internally, we give this an ID that makes it clear that it is an unverified person, such as
unverified/yang-liu
. (The slash would prevent this from being confused with an explicitly defined ID, since slashes are disallowed there. See below for what this means for the URLs.) -
If there is exactly one, then by default, we assign the paper to that existing person. This will prevent breaking things for people whose name is currently not ambiguous, a.k.a. most people. However, as this has the potential to introduce mistakes:
- We can indicate that this particular paper has not been verified to belong to that person, e.g. through some icon on the website.
- We can add a flag to that person's YAML entry, currently called
disable_name_matching: true
, that will disable this name-based assignment of unverified instances. This will prevent mistakenly assigning papers to this person if we already know that there is at least one other person with their name (e.g., they have opened a Github issue about this, but we do not yet know the identity of the other person(s) with this name). In this case, unverified instances of the name would be handled as in 1., i.e., anunverified/yang-liu
person would be created.
-
If there is more than one, we do not assign the paper to any of them, but to a
unverified/yang-liu
. This will prevent mistakenly assigning a paper to someone who is known to have an ambiguous name.
This approach is a compromise between not breaking things for most people, but still making verified author identities clear in the XML through an explicit ID attribute.
We have a set of new papers. One of the following situations will apply.
- ORCID match. Easy. Assign the paper to the matched ID.
-
No match, with ORCID. We create a new entry in
people.yaml
, generating a new ID. This could default to the name slug, if it doesn't exist as an explicit ID yet. If it does, we could decide on it manually, or add a number to the name slug. - No match, without ORCID. Easy. Simply record the name without an ID; the algorithm described above will decide which person this will be assigned to.
-
Merging pages: Easy. We add an entry in
people.yaml
if it doesn't exist yet, and set theid=
attribute to that person's ID on each of their papers. -
Separating pages: Easy. This means that the person doesn't have an explicit ID yet, and all their papers will be under
unverified/yang-liu
, potentially together with other Yang Liu's papers. We create a new ID for them and assign that to their papers the same way as suggested in v2.
Since in both cases, we ask the author to confirm all the Anthology papers belonging to them, we can also set the all_verified: true
flag at this point. (TODO @mbollmann: update)
To transition to this new system, we can use the currently assigned IDs and the data from name_variants.yaml
as a starting point, since nothing in this system technically requires an ORCID. We can write explicit IDs to the XML for every person currently inferred through name_variants.yaml
, with the exception of the "catch-all" entries (the ones that normally say "May refer to multiple people"). Notably, no new papers would be assigned to these IDs unless we also add an ORCID for them, as the ingestion only matches by ORCID. However, if these people do provide us with an ORCID at some point, it is as easy as adding a line to people.yaml
— their IDs will stay the same, their author URLs will continue to work, and future ingestions with their ORCID will correctly assign their papers to them.
As for URLs, I've glossed over one detail above, which is that an author without an ORCID on any of their papers would always receive an unverified/
ID, breaking current author pages. Here's one way to solve that:
- An unverified Yang Liu's papers could appear under a URL like
https://aclanthology.org/people/unverified/yang-liu/
. - As long as there is no explicit ID
yang-liu
(inpeople.yaml
),https://aclanthology.org/people/yang-liu/
could redirect/be rewritten tohttps://aclanthology.org/people/unverified/yang-liu/
.
This means that existing URLs will continue to function, but redirecting to /unverified/
URLs could discourage authors from using such URLs in the future, and encourage them to provide us with an ORCID and verify that the papers are actually theirs. We'd have an interface on these pages that links to/creates a GitHub issue providing us with the information needed for verification.
Let's say I currently do not have an explicit ID and my name is not ambiguous, which is actually true for me (marcel-bollmann
). My author URL https://aclanthology.org/people/marcel-bollmann/
would still work, but redirect to https://aclanthology.org/people/unverified/marcel-bollmann/
. If I don't submit my ORCID to the Anthology, but I have it on my OpenReview profile, and a new paper of mine gets ingested with my ORCID, the following would happen:
- An explicit ID
marcel-bollmann
would be created inpeople.yaml
and assigned to my new paper. - The URL
https://aclanthology.org/people/marcel-bollmann/
would no longer redirect, since there now is amarcel-bollmann
with an explicit entry inpeople.yaml
. - Per the "Name and person logic" above, my old papers would still appear on that page, but marked as unverified somehow. I can fix this through an interface on that page that allows me to check all the papers that belong to me on that page (which would be all of them) and submit a Github issue.
Only if a second Marcel Bollmann with a different ORCID comes along, my old papers would be removed from my page (since there is now a known ambiguity) and only appear on https://aclanthology.org/people/unverified/marcel-bollmann/
, until I submit the information that they all belong to me.
Let's say I am a first-time author named Zhao Xinping. I have my ORCID linked to OpenReview and now get my first Anthology paper ingested. Since the ID zhao-xinping
didn't exist yet in people.yaml
, the ingestion will have created it and put my name and ORCID under it.
I now go to https://aclanthology.org/people/zhao-xinping/
and notice that there other papers by people named Zhao Xinping. These were in the Anthology without an explicit ID, since we have not verified their identities yet. They are listed on my page due to the automatic name-matching in Scenario 2, which was desirable for "Marcel Bollmann" above, but here results in a mistake. I open an issue to clarify that these other papers are not mine. How do we fix this?
-
We could add an explicit ID to the other Zhao Xinping instances. However, this means we either need to find out the correct identities for all of them (and they’re not the ones who got in contact with us, so we can’t easily ask them), or we will start assigning "placeholder" IDs, like we do now. I think that’s not great, because it conflates verified persons with "catch-all persons" that just exist for technical purposes, i.e. to distinguish them from persons whose identities we do know.
-
We add a flag to the new Zhao Xinping's entry in
people.yaml
that expresses that all of their papers have an ID, so any remaining papers with the same name are not theirs. This would prevent the automatic name-matching from Scenario 2 from triggering, thereby fixing this mistake. This is thedisable_name_matching: true
flag.
From the continued discussion on #623:
The basic new idea is that we keep the current ID system. We pretend the current state of the Anthology is completely disambiguated. Every page currently under /people/
is a person who we assume is known and correct. There is no UUID for unknown people. There are people for whom we do and don't have ORCIDs, but that is just a tool to facilitate ingestions. It is not a first-order concept.
As Marcel noted, every paper currently points to a person, either explicitly (e.g., id=yang-liu-edinburgh
) or implicitly (every <author>
tag without an ID, where the ID is the name slug). We will make all of these explicit; every paper will look like
<paper>
<author id="matt-post">...
<author id="david-chiang">...
<author id="nathan-schneider-cuboulder">...
Essentially, the baked-in assumption is that in most situations, name == person
. This is actually a pretty safe assumption. Of the 371,280 <author>
tags in the Anthology, only 1,958 currently have an id
attribute: 0.5%. People actually seem to have pretty distinct names.
I think things get a lot simpler with this approach. Goals (1) and (2) below are met trivially: we are keeping the current system, which is name-centric, and uses no numeric IDs. We will not use numeric IDs in the future. What we are doing is updating the structures to support it.
We have two core tasks:
- Assigning papers to people. This is easy for any unambiguous name, and hard for ambiguous names. We want to fix that.
- Disambiguating people when we find a mistake. This is possible in our system, but it is cumbersome.
I think it will be helpful to work through every scenario from a workflow perspective. Here, I am careful to use our established terminology, distinguishing a name ("Yang Liu"), its slug (yang-liu
), and all the people who use that name (yang-liu
, yang-liu-edinburgh
, yang-liu-hk
, etc). Every person has an ID, comprising their canonical name slug and optionally some distinguishing decorator (yang-liu
, yang-liu-edinburgh
, etc). For the vast majority of Anthology entries, the name slug is the ID.
We have a set of new papers. One of the following situations will apply. A "name match" means the slugged name on the ingesting paper matches one in the Anthology.
- ORCID match. Easy. Assign the paper to the matched ID.
-
Name match, unambiguous. Easy, since
name==person
. If the ingested paper has an ORCID, the person is now known, and we add that info to the DB. - Name match, ambiguous. Hard. We default to the matching name slug (as we mostly currently do), resolve manually (as we theoretically currently do), or employ some ML. If the ingested paper has an ORCID, we assign it to the chosen name (in this situation, we could also filter out matching names that have ORCIDs).
- No match. Easy. We create a new person. If the ingested paper has an ORCID, that person is known.
Every situation here is straightforward except (3). However, we also expect it to be rare. (1) should be our most common situation. Anything exported from Open Review should no longer require manual disambiguation.
To support the above, we need a database of every name ID. I suggest a file, data/yaml/people.yaml
, which records them in the format Marcel suggested:
person-id:
- orcid: ...
- canonical-name: ...
first: ...
last: ...
- variants:
- first: ...
last: ...
- other info: ...
There is no need to split into "known" and "unknown" because this isn't the primary organizational principle. Every ID is a real person; ORCIDs are nice for ingestion, but they are secondary. We might also want to split other ways, e.g., data/yaml/people/{a.yaml,b.yaml,...}
, etc, which could help with handling merge requests. We might also record all known name variants for readability and quick loading purposes.
There is an issue with (3): we risk over-assigning papers without an ORCID to the person with the base name slug if we use that default assignment approach. I don't see how this is avoidable. To resolve this, we should push submitters to obtain ORCIDs, and we should build a small ML model that can help with this. It should not be difficult, from paper metadata alone (e.g., coauthors, titles, abstracts). I bet a customized but off-the-shelf LLM query could resolve this for us.
Note that, by the process above, the first person to ingest a paper with an ORCID under a name will "lock in" that ID to themselves. We can also backfill many of the papers in the Anthology with ORCIDs that were submitted with ingestion materials for the last several *ACL conferences.
We have the following kinds of disambiguation:
- One person, many names (handled by merging pages)
- One name, many persons (handled by creating a new ID)
Easy. We require the person to obtain an ORCID, for our own purposes. We then update the relevant paper IDs, set the canonical name, and extend the person's list of variant names.
Easy. This is still a manual process, but not that it's not that common. We require the person to obtain an ORCID, and get their canonical name variant (basically, we fill out the missing info). We then ask them for the institution of their highest degree, and use that to create a custom ID, e.g., (the nonexistent) matt-post-slu
. We then update the person IDs on all relevant papers.
I think both of these could be handled by a text search-box, similar to our "fix data" button:
- A popup would ask for the name, ORCID, terminal degree institution, and then a list of all Anthology IDs for their papers
- This would get used to create a Github issue via a PR template
There is an editorial decision related to a newly-created person ID (e.g., full or abbreviated name of the university). This can just be worked out based on aesthetics. I don't think it matters.
There can still be a conflict if two people with the same name have the same university. I would say we just deal with this if we come across it. We could have one of them choose a different institution.
We want to build a new author page format. This format should
- Center human names rather than IDs
- Fit within our current Github + static web page framework
- Make it easy to a. Consolidate multiple names under a single author profile b. Allow a particular name to be used by multiple authors
- Maximize user-serviceability
- Minimize demands on Anthology staff time
First, some terminology.
- A name is a surface string displayed on a PDF and recorded verbatim in a paper or talk's metadata.
- A person is a human being.
- A person may have multiple names, in which case each is called a variant, one of which is the canonical variant, typically selected by that person for current use.
We have the following problems:
- Name resolution. Names can be ambiguous in multiple ways: a name can be used by multiple people, and a single person can have multiple names. We therefore have an issue at ingestion time of determining which person a given name belongs to.
- Cumbersome disambiguation process. Our current resolution process is cumbersome and time-consuming. The IDs are assigned haphazardly, usually using an individual's Ph.D.-granting institution (e.g., Yang Liu/Edinburgh). The ID is largely hidden in our representation and is inconsistent in display (only some authors have a distinguisher). The process is hard to explain to users.
- Author pages. We have no permanent, consistent landing page for each author that we can point to. There is increasing demand for such a page from both people and from reviewing bodies.
Each author in the Anthology will have a page with the format aclanthology.org/people/{name}/{identifier}
. A person is known to the Anthology when we have their ORCID. This leaves two situations:
- If the person is known to the Anthology,
{name}
is their canonical name variant and{identifier}
is the ORCID. - Otherwise,
{name}
is the name found on a paper, and{identifier}
is a UUID that is designed to be uglier than an ORCID. - In certain situations (e.g., for prominent deceased authors), we may choose a more usable UUID (for example, we may wish to use
/people/aravind-joshi/1
).
The name page (aclanthology.org/people/{name}) will serve as a "disambiguation page. It will present a list of all known people with that name. It will not display any papers, in order to discourage people from using that page as their main author page.
We will add a mandatory id
attribute to every <author>
and <editor>
tag that identifies the person. This tag will have the form {name}/{identifier}
, based on whether the author is known or unknown.
We will maintain a file, data/yaml/canonical_names.yml
, whose keys are identifiers and whose values are canonical name variants, e.g.,
0000-0002-1297-6794: matt-post
In this section, we list how common operations will be carried out.
Ideally, ingestion materials will carry ORCIDs. Here is the process for handling that:
- If the ORCID is new, create an entry for it, mapping to the name's slug.
- If the ORCID exists, find its entry in
canonical_names.yml
and use that as the identifier. Note that this needs to be done iteratively within each ingestion, because we could have a new ORCID that maps to multiple variants. In this case, the first name will be adopted arbitrarily as canonical.
There really is no disambiguation problem any more. Every paper assigned to a known person should be correct. The name ambiguity problem should be solved. The remaining problem is name grouping, e.g., when a name is assigned to an unknown person via UUID.
Name grouping is when we group multiple names under a single person. This is always the result of an explicit request. In the new setting, we require an ORCID to do so. We can then update the id
attributes on all papers and create an entry in canonical_names.yml
to reflect the canonical variant.
We would ideally present a nice UI for this. It would be a dynamically updated text search box, similar to when you type addresses. It would allow a user to select all of their papers for grouping together under the new ID. This would take some work to work out.
TODO