CSV annotations export - metaspace2020/metaspace GitHub Wiki

Annotations CSV Export

Columns

group, datasetName, datasetId - metadata about the dataset this annotation belongs to.
formula - The formula for the base molecule of the annotated ion, e.g. H2O for water.
adduct - The adduct applied to the formula for this annotation, e.g. M+H for protonation.
chemMod - The formula for the chemical modification (if any). Normally this will be empty unless chemical modifications were added in the annotation settings.
ion - A combination of the formula, adduct, chemMod, and charge of this annotation that can be used to uniquely identify this annotation within the dataset.
mz - The theoretical m/z of the first peak of the ion (including the adduct mass and with an electron mass added/removed). This mass may differ very slightly from the exact theoretical mass, as a centroiding simulation is done on the theoretical peaks, which introduces a small amount of numerical error (generally <0.1ppm).
fdr - Global False Discovery Rate for this annotation. "Global" here means e.g. out of the set of all annotations with fdr <= 0.1, 10% of them are expected to be false discoveries.
msm, rhoSpatial, rhoSpectral, rhoChaos - Scores used to evaluate the annotation's FDR. See this figure for more details.
moleculeNames, moleculeIds - names and database IDs for the putative molecules that match this annotation's formula. See the note below for how to programmatically split these lists if needed.
minIntensity, maxIntensity - the lowest and highest intensity in the first ion image.
totalIntensity - sum of intensities in the first ion image.
colocalizationCoeff - if a "colocalized with" filter is applied, this column contains the colocalization coefficient to the molecule used for comparison.
offSample - Result from running the OffsampleAI image classification model to check whether the ion image looks like it is off-sample. true means that it looks off-sample, false means that it looks on-sample.
rawOffSampleProb - The predicted probability that the image is off-sample. Values higher than 0.5 are considered off-sample.
isomerIons - A comma-separated list of the ion values of other annotations that are isomeric (i.e. identical isotopic m/zs).
isobarIons - A comma-separated list of the ion values of other annotations that are isobaric (i.e. different but overlapping isotopic m/zs).

Loading the .csv file

Comment lines

The CSV files exported by METASPACE contain a timestamp and a link to the source data in the first two lines of the file. This can cause issues with some CSV loaders.

To load a METASPACE CSV export with Pandas use the skiprows=2 argument:

import pandas as pd
annotations = pd.read_csv('metaspace_annotations.csv', skiprows=2)

In plain Python:

from csv import DictReader
annotations_lines = open('metaspace_annotations.csv').readlines()[2:]
annotations = list(DictReader(annotations_lines))

Character Encoding

Some spreadsheet programs will occasionally incorrectly detect the character encoding of the CSV files, causing names such as 8-hydroxy-2-phenyl-1λ⁴-chromen-1-ylium to appear mangled, e.g. as "8-hydroxy-2-phenyl-1Î»â´-chromen-1-ylium". The solution to this is to close the file, reopen it and select Unicode (UTF-8) or UTF-8 as the character encoding when prompted.

Information about specific columns

Molecule IDs

In the Annotations CSV file, each row may specify multiple values in the moleculeNames and moleculeIds columns. The items in these lists are delimited by , (comma then space). If a molecule name naturally contains a comma followed by one or more spaces, such as HMDB0032389, then the spaces are removed to ensure they're unambiguously parseable, e.g. "Methyl acrylate-divinylbenzene, completely hydrolyzed, copolymer" would become "Methyl acrylate-divinylbenzene,completely hydrolyzed,copolymer".

Off-sample

The rawOffSampleProb column contains the unmodified raw output of the off-sample prediction model as a number from 0.0 to 1.0. For most datasets this number is close to the probability that the ion image is off-sample, however it is not guaranteed to be accurate. Datasets that are too dissimilar to the datasets that the model was trained on will have less accuracy. Furthermore, toward the extremes of the scale, the model tends to be overconfident. e.g. a prediction with a rawOffSampleProb value of 0.0000001 may still have a 1-10% chance of being off-sample.