Data statement schema - acm-toce/documentation GitHub Wiki

For data set submissions, we recommend submitting a data statement that aligns with the following schema.

The statement should include the following sections:

Header

The header should include the following:

  • What
  • Dataset Title
  • Dataset Curator(s) [name, affiliation, role]
  • Dataset Version [version, date]
  • Dataset Citation and, if available, DOI
  • Data Statement Author(s) [name, affiliation, role]
  • Data Statement Version [version, date]
  • Data Statement Citation and, if available, DOI
  • Links to versions of this data statement in other languages

Executive Summary

The executive summary is a short (60–100 word) summary of the data statement that at a minimum should include: (1) a one-sentence description of the curation rationale, (2) the data types, and (3) an overview of relevant quantitative information such as the dataset size.

Curation Rationale

The curation rationale should provide answers to questions including the following, to be interpreted both as future-looking prompts for dataset design and informational questions from users of completed datasets: What is the intended purpose of this dataset? What is the task or research question the dataset is intended to address? What data is included and what are the goals for including it? What is the internal organization of the dataset? What constitutes a data instance?

Documentation for Source Datasets

For datasets built out of other pre-existing datasets, a link to a data statement for each source dataset should be included. If a data statement is not available, provide a link to a publication or other documentation. Provide links to licenses, copyright, or terms of use for source datasets, where applicable.

Data Description

A thorough description of all the dataset contents. Describe the structure and format of the dataset, and define each data type included in the data (e.g. tabular, time series, text, audio). If the dataset includes multiple data modalities, their relationship should be specified.

For text data

All of the languages and language varieties represented in the dataset should be characterized with (1) a language tag from BCP-47 identifying the language variety (e.g., en-US or yue-Hant-HK), and (2) a prose description elucidating and elaborating on the BCP-47 tag (e.g., English as spoken in Palo Alto, California; Cantonese written with traditional characters by speakers in Hong Kong who are bilingual in Mandarin; French Sign Language as used in Marseille, France).

Subject Demographics

All of the subject groups represented in the dataset should be characterized with a prose description when demographic information is available. Demographic categories are context- and culture-specific; therefore, locally appropriate categories and definitions should be used as determined by the community. Suggested specifications include:

  • Age
  • Gender
  • Race/ethnicity
  • Socioeconomic status
  • Academic level

Additionally, include any demographics specifically relevant for the datasets use cases. For example, a Natural Language Processing dataset might include:

  • First language(s)
  • Proficiency in the language(s) of the data
  • Number of different language users represented
  • Presence of disordered speech or sign

Annotator Demographic

All of the annotator groups represented in the dataset, including those who develop the guidelines, should be characterized with a prose description. Demographic categories are context- and culture-specific; therefore, locally appropriate categories and definitions should be used as determined by the community. Suggested specifications include:

  • Age
  • Gender
  • Race/ethnicity
  • Socioeconomic status
  • First language(s)
  • Proficiency in the language(s) or subject matter of the data being annotated
  • Number of different annotators represented
  • Relevant training

Data Collection

A description of the situation in which the data was collected and/or the relevant characteristics should be provided. This schema element may also be used to describe the cultural context of the collected data. Suggested specifications include:

  • Time and place of data collection
  • Date(s) of data collection
  • Modality
  • Synchronous (e.g., in-person or live online meetings) vs. asynchronous (e.g., emails, forums, automated collection software) data collection
  • Data genre (e.g., newswire or social media) or topics (e.g., entertainment or natural disaster)
  • Context of data collection (e.g., photos participants were all looking at; a task the participants were given)
  • Any additional details about the cultural context

Preprocessing and Data Formatting

A description of all preprocessing and data formatting modifications made to the data (except for annotations) should be provided, including information about any anonymization procedures. The description should also specify any tools used to make the modifications and whether the raw data is included in the dataset.

Capture Quality

A description of quality issues in data capture should be provided. This includes all types of quality issues that arise across a broad range of collection methodologies for capturing an otherwise impermanent event.

Limitations

For any challenges not fully addressed, a description of those challenges and characterization of the resulting limitations of the dataset should be provided.

Metadata

A collection of pointers to relevant metadata should be provided. Suggestions include:

  • Annotation Guidelines: Link to the published or online guidelines used by annotators
  • Annotation Process: Link to documentation providing metadata about the annotation process, including protections for annotator anonymity, annotator compensation, and any automated processes producing annotation
  • Dataset Quality Metrics: Metrics for inter- annotator agreement and/or other numerical scores of dataset quality

Disclosure and Ethical Review:

For projects supported by funding, a description of the funding source for the dataset and relevant information (e.g., grant number) should be specified. For projects that went through an ethical approval process, a link to the approving body (e.g., IRB) should be provided. In addition, include: a brief description of any consent processes; if participants in the dataset or annotators are compensated, how compensation rates are determined; and any potential conflicts of interest.

Distribution

A description of how the dataset is to be distributed should be provided. This includes the method of distribution (e.g., through a data archive, files on website, API, GitHub) and any access restrictions (e.g., sensitive or confidential content, intellectual property (IP) considerations, export controls, or other regulatory restrictions). If an IP license, copyright, or terms of use (ToU) applies to any portion of the dataset, provide links to or reproduce the licenses, copyright, and/or ToU, and list any fees associated with these restrictions. Other suggestions for detailing the distribution plan include providing such information as:

  • Who has access to the dataset as of the writing of the documentation and who else it is intended to be distributed to
  • What conditions, if any, there are for obtaining access to the whole dataset or any subsets of it
  • Whether the dataset has a digital object identifier (DOI)
  • Date(s) of distribution of the dataset

Maintenance

A description of how the dataset is to be maintained should be provided. This description should specify who is supporting, hosting, and maintaining the dataset and how to contact the manager of the dataset. Other suggestions for detailing the maintenance plan include providing such information as:

  • Where to find and contribute to information about errors in the dataset How often, by whom, and how updates to the dataset are communicated to users
  • Applicable limits on data retention and how those limits will be enforced (e.g., respecting agreements with data subjects regarding whether and when their data will be deleted)
  • Whether older versions of the dataset are to be supported, hosted, and maintained
  • How users are to be notified that the dataset is outdated or no longer available Whether others can contribute to the dataset and how; whether and how any contributions are validated and further distributed to other users

Other

Any further considerations that are relevant for the dataset should be included here.

Glossary

A list of terms and associated definitions that may be technical or unfamiliar to non-experts should be provided.