EukMap TSV format - wildcart/unieuk-wiki GitHub Wiki

UniEuk / EukMap TaxFile import/export Format

The EukMap TaxFile format is the primary format supported by the EukMap platform to import and export taxon related information. It is also used to replace / update taxa during the replace operation. It consists of an (optional) header line specifying the content of each field followed by an arbitrary number of lines, describing the taxa that should be replaced / updated.

Format

The default format supported by the EukMap platform is a tab (\t) delimited file using UNIX/macOS style line breaks (\n), and the value of each field is considered to be unquoted. Optionally, the field delimiter, line breaks, and the character used to quote values can be changed before importing the taxonomy or replacing / updating a taxon and its subtree.

File Encoding

Files have to be saved using UTF-8, UTF-16 or UTF-32 encoding to support non Latin Unicode characters, for example when specifying author names. Files may optionally include the Unicode Byte-Order-Marker at the beginning of the file.

Quoted Values

Values may be quoted using any character that is not used to delimit fields or separate records (line break). If values are quoted, the first and last character of value must match the selected quote character. Starting a value with the quote character but not ending the value with the same character causes an error. Quote characters used within or at the end of a value will be read as literal characters as long the value itself is not quoted. The quote character itself must be quoted if it is used within a quoted value.

Example how values are interpreted using double quote (") as quote character:

value                                       interpreded value
unquoted value                              unquoted value
"quoted value"                              quoted value
"quoted ""value"" using quote character"    quoted "value" using quote character
unquoted "value"                            unquoted "value"
"unsupported" quoting                       ERROR

The last value in the list above is considered quoted because it starts with the selected quote character. It causes an error because the field delimiter is expected immediately after the end of the quoted value (second "). In this case, however, the field contains extra characters that are not considered to be part of the quoted value nor delimit the field.

Content

The first field in each line must identify the taxon that should be imported or replaced / updated. Additionally, the file may contain any number of fields containing additional information about the taxon. The EukMap TaxFile Format defines fields that are interpreted by the platform but the file is not restricted to only include such fields.

Taxonomic Path

The first field must be titled taxon and for each line describing a taxon it must include the taxonomic path of that taxon. Taxa in a taxonomic path are separated by semicolon and all lines must be sorted alphabetically according to the taxon's path. Trailing semicolon will be ignored and A;B and A;B; are considered equal.

For example, to replace / update taxon R with taxonomic path A;B;C and its children R1, S (with children S1, S2) and T1, the file has to contain seven lines (six if the header is skipped): the header line followed by six lines describing the taxa. The second line has to specify taxon R itself and the following lines specify each of R's children and children of children:

taxon
A;B;C;R
A;B;C;R;R1
A;B;C;R;S
A;B;C;R;S;S1
A;B;C;R;S;S2
A;B;C;R;T1

Absolute vs Relative Taxonomic Path

The taxonomic path of a taxon can either be an absolute path starting at top level taxon, usually domain, or a relative path starting with the name of the selected taxon. For example, if taxon C and its subtree is to be replaced / updated, the taxonomic paths in the first field must start with either the top level taxon A (assuming A and B are ancestors of C) or C (the selected taxon) for all taxa specified in the file:

Absolute Path:

taxon
A;B;C
A;B;C;D
A;B;C;E
A;B;C;F

Relative Path specifying the same subtree:

taxon
C
C;D
C;E
C;F

Specifying taxa that are not descendents of the selected taxon cause the replace operation to fail:

taxon
A;B;C
A;B;
A;G;H
I;

Completeness

The file must include all taxa included in a taxonomic path as distinct records. For example, to create taxon C with taxonomic path A;B the file must include three lines, one line for taxon C itself, as well as one line for each of its ancestors A and B.

taxon
A
A;B
A;B;C

Spaces

Leading and trailing spaces will be ignored in both the taxon field and in taxa in the taxonomic path and all examples below are considered equal. The examples use the double quote (") as quote character to more easily show trailing spaces:

taxon;
" A;B;C"
" A;B;C "
"A;B;C "
"A; B;C"
"A; B ;C"
"A;B ;C"

Metadata

Additionally, each line may contain metadata describing the specified taxon. These fields are optional and may be specified in any order. If the header line is omitted, the parser will ignore all fields except the first field containing the taxon (and its classification). Fields may be left empty if data is not available/unknown for certain taxa and leading / trailing whites will be trimmed by the parser.

taxon   rank      author          field
A;      domain    Author Year     some value
A;B     phylum                    another value

The parser supports two types of fields: supported (named) fields and custom fields containing arbitrary data that will be stored in a key-value map and shown in the custom properties of a taxon.

Named Fields

Named fields are fields that are supported (understood) by the platform. The platform converts the values contained in these fields into internal data types which are then mapped to specific taxon properties. Controlled vocabulary is used whenever possible to restrict the values which may be entered into these fields to terms that are commonly used by the community.

Fields may also contain a lists. List items are separated by semicolon, unless stated otherwise in the description of the field below. The format of list items may need to follow a certain syntax as described in the documentation of the field and list items will be sorted alphanumerically during import. Also, duplicate items will be ignored.

The order of supported fields is not important and will be determined automatically by the parser based on the header line.

Full list of supported fields (in alphabetic order):

  • taxon: [A;B;C;name of taxon] The full taxonomic classification of the taxon (its taxonomic path). Elements in the path must be separated by semicolon (;) and the last element in the path is considered to be the taxon's name. Trailing semicolons are ignored by the parser, eg. no taxon without name will be created if the taxonomic classification ends with a semicolon.
  • accessions: [list] A list of accession numbers identifying INSDC sequence entries linked to a taxon. The parser support accession ranges in the form ACCESSION-ACCESSION which will be expanded during import. For example, the range AB000123-AB000125 will be expanded into three distinct accession numbers. The parser also supports to iterate through the leading alphabetic prefix of accession numbers, however the number of characters must be the same in both accessions, AA99-AB01 will be expanded into three accession AA99, AB00 and AB01, while the range AA99-AAA01 is not supported. If an accession is not identified as a range or the range format is not support the accession will be added as is.
  • alternative names: [list] A list of alternative names used by scientists or non-scientists to refer to this taxon. Names in this list may be common (informal) names that do not meet the requirements of taxonomic synonyms which are listed in the synonyms fields (see below).
  • author: [author year] This field may contain the name of a single author, a list of authors, or a list of author abbreviated by "et. al". The author (list) must be followed by the year in which the description was published. A taxon for which author information is specified is considered an authoritative taxon and special rules will be applied when renaming such taxa.
  • candidate: [boolean] Informal flag indicating that a taxon or subtree of the taxonomy needs revision. This flag is an informal idea to help flag areas of the tree that require revisions / where descriptions of new taxa seem necessary at some point. Values of true and yes will flag a taxon as candidate while all other values as well as empty fields will not flag taxa.
  • description: [text] The formal description of a taxon. Due to limitations of the CSV/TSV formats, for example line breaks, it is not recommended to include complex descriptions in the replacement file.
  • envo terms: [list] List of ENVO terms linked to this taxon specified as ENVO_ followed by eight digits (with leading zeros) ENVO_12345678.
  • locations: [list] List of GPS coordinates the taxon has been observed. GPS coordinates must be specified as decimals (using .) and the colon (:) must be used to separate latitude and longitude. Multiple GPS coordinates must be separated by semicolon: 12.3456:65.4321;-12.3456:-65.4321.
  • notes: [text] Free text field that may contain 'technical' comments regarding the status of the taxon, for example "needs revision." Due to limitations of the CSV/TSV formats, for example line breaks, it is not recommended to include complex notes in the replacement file.
  • rank: [controlled] Taxonomic rank of the taxon. The EukMap platform supports all taxonomic ranks also supported by the NCBI taxonomy including all major ranks from domain (highest) to isolate (lowest) are supported and, additionally, super, sub, infra, and para rank 'modifiers' where applicable. The full list of supported ranks is documented in the supplemental material of Schoch 2020
  • status: [controlled] Nomenclatural status the taxon. The EukMap platform supports commonly used terms to describe the nomenclatural status of a taxon. It supports the list of terms supported by the NCBI taxonomy (Schoch 2020), as well as additional terms. The full list of supported terms is documented in Nomenclatural-Status
  • synonyms: [list] List of taxonomic synonyms. Compared to alternative names, taxonomic synonyms are scientific names that have to follow specific rules according to nomenclature code.
  • type: [controlled] Type of the taxon. The EukMap platform supports commonly used terms to describe the type of taxa. The full list of supported terms is documented in Taxon-Type

Custom Fields

Additionally, users may specify custom fields that are not directly supported by the platform. These fields will be imported and included in a list of custom properties of a taxon. The field's header is used as property key and the value as the property's value. In the Web front-end, the list of properties will b e displayed as a simple list of key value pairs included on the Properties tab of the taxon (meta)data display.

As for supported fields, the order of custom fields is not important and supported fields and custom fields may be mixed.

⚠️ **GitHub.com Fallback** ⚠️