XML Format - acl-org/acl-anthology GitHub Wiki

Much of the structure of the XML format is specified by the RELAX NG schema (data/xml/schema.{rnc,rng}) and can be validated automatically. This document describes the structure less formally and also describes aspects of the format that aren't specified by the schema.

Structure

The root element is <volume id="X99">, where X is replaced by the one-letter code for the venue and 99 is replaced by the last two digits of the year.

The <volume> element has child elements <paper id="9999">, where 9999 is replaced by the four-digit paper identifier. For some venues (LREC), there is also an href attribute for the external URL of the paper.

Each <paper> element has several child elements:

  • <title>: The title (see below for more details)
  • <author>: The authors (see below for more details)
  • <editor>: The editors (see below for more details)
  • and others.

Text Fields

Text fields (<title>, <author>, etc.) are written in Unicode (UTF-8). The following elements are currently allowed for formatting:

  • <tex-math>: math formulas, coded using TeX (equivalent to TeX $...$). For example: An <tex-math>O(n^3)</tex-math> Algorithm for Parsing Context-Free Grammars.
  • <url>: a URL, displayed in typewriter font and hyperlinked
  • <i>: italics
  • <b>: boldface

Below are additional guidelines for specific fields.

Title

The title should be written in title-case. The Anthology doesn't currently have rules for what "title-case" means exactly, but individual meetings/journals might. Characters whose case should be preserved even when a bibliography style uppercases or lowercases the title should be placed inside a <fixed-case> element (this serves the same purpose as curly braces in BibTeX). For example:

<title>The <fixed-case>ACL</fixed-case> <fixed-case>A</fixed-case>nthology: Current State and Future Directions</title>

Authors and Editors

Each author/editor name must have exactly one <last> element and at most one <first> element.

  • The <last> element contains the name(s) by which papers are cited and their bibliography entries are sorted alphabetically. If an author has only a single name, that name should go into the <last> element. A "lineage" like Jr. or III should go into the <last> element.

  • The <first> element contains all other names, including middle names/initials.

The name should generally appear in the XML the same way that it does on the original paper. Exceptions:

  • Known misspellings should be corrected.
  • Names written in all-caps or all-lowercase should be converted to caps and lowercase, unless the person habitually writes the name that way.
  • Names abbreviated to initials may be expanded to full names if it's known that this is one of the author's preferred way of writing their name. There are many reasons why an author might prefer an initial, so if there's any doubt, the XML should preserve the spelling on the original paper.

The Anthology also needs to know what individual a name refers to. Please see the page on Name Variants.

Link fields

Paper PDFs are linked in three ways.

  • <url>URL</url>: URL of Anthology-hosted PDF.
  • <paper href="URL">...</paper>: URL of externally-hosted, non-ACL-sponsored PDF (currently used mainly for LREC)
  • <href>URL</href>: URL of externally-hosted, ACL-sponsored PDF (currently used mainly for TACL)

Other files can be linked as well:

  • <software>filename</software>
  • <dataset>filename</dataset>
  • <attachment type="...">filename</attachment> where the type is 'note', 'presentation', 'poster', 'attachment', or missing
  • <mrf src="latexml">filename.xhtml</mrf> (machine readable format? Mr. F?)
  • <video href="URL" tag="video"/>
  • <revision id="2">Q15-1022v2</revision>
  • <erratum id="1">Q15-1022e1</erratum>
⚠️ **GitHub.com Fallback** ⚠️