Annotation Guidelines - BgeeDB/expression-annotations GitHub Wiki

Contents

Expression data annotation

Introduction

Bgee is a database for retrieval and comparison of gene expression patterns across multiple animal species. It provides an intuitive answer to the question "where is a gene expressed?" and supports research in cancer and agriculture as well as evolutionary biology. Bgee is based exclusively on curated "normal", healthy wild-type expression data (e.g., no gene knock-out, no treatment, no disease), to provide a comparable reference of normal gene expression.

In order to be included in Bgee, expression data needs to be annotated with a developmental stage and an anatomical structure.

This public wiki aims at describing how the expression data annotation is done in Bgee. But it is also supposed to make the handling of the process easier for new annotators, and may contain gitlab links restricted to Bgee annotators only.

This wiki can also be used for the annotators as a place to write down some pragmatic choices they have made in the annotated data and that are likely to come again in new data (for example: "do we annotate fasted mice as normal condition?"). For any issues regarding the use of Uberon ontology (for examples: "do we annotate human mammary gland with "thoracic mammary gland" (UBERON:0005200) or with "mammary gland" (UBERON:0001911) "), please check and report on the Gitlab: https://gitlab.sib.swiss/Bgee/expression-annotations/-/issues

The repository for annotation files is on the Gitlab:

https://gitlab.sib.swiss/Bgee/expression-annotations/-/tree/develop

Annotation process

Expression datasets are called "Experiments". Experiments are identified by an SRPid (alternatively ERPid or DRPid) in the SRA (e.g. SRP308378). Each experiment has one or more libraries, identified by their unique SRXid (e.g. SRX10176227). Caution: In the SRA, experiments are referred to as "studies" and libraries as "experiments".

The following steps are followed during annotation:

  1. Find datasets (experiments) to annotate
  2. Control that a given experiment and its libraries can be accepted in Bgee (normality criteria and technologies)
  3. Annotate the experiment and the libraries
  4. Commit your annotations to GitLab

We are currently focusing on the annotation of bulk RNA-Seq (here below 'RNA-seq') and single cell RNA-Seq (here below 'scRNA-seq') expression data.

Finding datasets to annotate

In order to help retrieving and maintaining updated the set of experiments to annotate in Bgee, we have developed Python scripts. These scripts allow to retrieve datasets and/or automatically pre-fill the annotation files. You can find the scripts in https://gitlab.sib.swiss/Bgee/scRNA-Seq/-/tree/main/scripts. See the script README files for documentation, instructions and examples.

There is also an in-house developed R script to retrieve experiments for bulk RNA-Seq in https://gitlab.sib.swiss/Bgee/expression-annotations/-/tree/develop/RNA_Seq/utils.

Papers with potential datasets to annotate can also be found on slack channels (restricted access), #datasets_to_annotate , and #evolexpress, or listed in our gitlab issue 90 for bulk RNA-seq, and in issue 68 for scRNA-seq (restricted access).

Control normality criteria and technologies

Before annotating, for each experiment it is important to check that there are samples with "normal" conditions (no knock-out or mutation, no cancer, no treatment...). In the 'Normality criteria and accepted technologies' section below are listed what we consider currently as 'normality' and the list of accepted technologies. The scripts that pre-fill the annotation files do have a filtering step for the accepted technologies.

Always try to find the paper that corresponds to each experiment. Often important information is missing from the SRA that can be found in the "Methods" section of the article. Sometimes it is only possible to tell that an experiment does not fulfill the normality criteria by reading the detailed protocol and checking the supplemental files. This is especially important when annotating scRNA-Seq experiments. See the section below: Retrieve papers with expression data.

BE CAREFUL: what is called control does not always correspond to "normal" conditions (heterozygotes, floxed mice, vehicle (PBS, saline, oil...) treated, untreated mutant, sham operated, GFP-expressing...).

Manual annotation of expression data

In this section we give an overview of the files used for annotation and the way annotation fields are filled when annotating an experiment.

There are scripts to automatically help pre-filling the annotation file, see https://gitlab.sib.swiss/Bgee/scRNA-Seq/-/tree/main/scripts. See the script README files for documentation, instructions and examples.

Detailed instructions and tips for the annotation of anatomical structures and developmental stages are given in the sections below.

Please remember looking at the page for Bgee accepted technologies before starting your annotations https://github.com/BgeeDB/expression-annotations/wiki/Annotation-Guidelines#accepted-technologies

RNA-seq data

Datasets generated from pooled cell populations or tissue sections, known as 'bulk' RNA-seq. The source repository by default is SRA, and the accepted libraries are described in this issue:
https://gitlab.sib.swiss/Bgee/expression-annotations/-/issues/29.

To understand the fields available in SRA library description, please look at https://www.ncbi.nlm.nih.gov/snp/docs/submission/hts_submission_formatting_intro_meta_formatting/

RNA-seq data can have different accession prefixes, see http://www.ncbi.nlm.nih.gov/Traces/study/?go=home

Some papers report 'SRAid' instead of SRPid, here the link to retrieve SRX libraries from SRAid experiment

To retrieve information about experiment (= SRA study) and libraries (= SRA experiment), see the example below:

http://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP000001

from this page, you can go the 'SRA Run Selector' page set by clicking on 'Runs' on the right-hand side

https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP000001

RNA-seq files used in annotation:

for a detailled description of the columns, see section below

RNASeqExperiment,tsv

This file lists the experiments (= SRA study) annotated in Bgee annotation files for all species but C.elegans (worm), see 'RNASeqExperiment_worm.tsv' instead.

RNASeqLibrary.tsv

This file report the annotation of the libraries (= SRA experiments) annotated in Bgee with mapping to Uberon and developmental stage ontologies, for all species but C.elegans (worm), see 'RNASeqLibrary_worm.tsv' instead

RNASeqLibrary_not_included.tsv

We use 'RNASeqLibrary_not_included.tsv' to list experiments (for all species but C.elegans) not to include into Bgee. In this file are listed experiments not yet annotated, means, they are a priori (see our RNA-seq filters procedure) not compatible experiments. If an annotated experiment should be discarded because detected as not compatible AFTER annotations done, it is commented (#) in the annotation files directly.

RNASeqLibrary_worm_exclusion.tsv

For C.elegans ('worm') data, we do not discard/comment libraries in the annotation file, but report them in the 'RNASeqLibrary_worm_exclusion.tsv'. We do that because presumably we should receive annotations done by WormBase and we are more stringent than they are, so we reject some experiments, and those Bgee rejected experiments could be again present in further release from WormBase. So we avoid to re-do this Bgee stringent selection once again.

('exclusion' versus 'not_included', means, 'already annotated but finally rejected', versus 'not to annotate')

scRNA-seq data

Datasets generated from single cell, either as full-length (FL) sequencing, or as droplet-based/target-based (TB) sequencing. The annotation of single-cell RNA-seq expression data is done in a similar way we do annotate 'bullk' RNA-seq, but involved additional files and protocol description to report.

See the dedicated page for single-cell RNA-seq here

scRNA-seq files used in annotation:

for a general description of the columns, see section below

scRNASeqExperiment.tsv

scRNASeqFLLibrary.tsv or scRNASeqTBLibrary.tsv

For full-length scRNA-Seq and target-based scRNA-Seq respectively.

Detailled description of scRNASeqFLLibrary.tsv available here

scRNASeq_barcode_ExpID.tsv

This file reports barcode information for 10X Genomics single cell RNA sequencing experiments, here some further information. As this file is usually very big, we manage a file by experiment.

scRNASeq_markers_10X.tsv

This file was initially created to help validating the cell type annotation after clustering.

scRNASeqLibrary_not_included.tsv

the file to put libraries not included (so far) with a comment to explain the reason.

scRNASeqExperiment_worm.tsv

the file to report the annotated scRNA-seq datasets from C.elegans

scRNASeqLibrary_worm.tsv

the file where the scRNA-seq libraries from C.elegans are annotated

Affymetrix data

Since Bgee 12, we decided to retrieve Affymetrix expression data mostly from GEO

For help retrieving data from GEO, see here

General guidelines for expression data annotation

TIP: when you work on the annotation file (open via Open office calc (Windows) or Libre Office (Mac)), the fact of also opening in 'read-only mode' the dedicated annotation file (RNASeqLibrary.tsv or FL/TB-scRNASeqLibrary.tsv) helps to see how previous similar annotations have been done, thanks to the filter options

Table format for expression data annotation

below the description of each annotation file for RNA-seq data (and scRNA-seq data in a similar but still in progress way)

Important: Annotation files should be opened in OpenOffice calc or LibreOffice and not Microsoft Excel. Using Excel may lead to incorrect data formatting and data loss.

RNASeqExperiment.tsv

Annotation of an experiment, known as a 'STUDY' with accessions in the form of SRP#, ERP#, or DRP# in SRA, and known as a 'Serie' (GSE#) in GEO; add new lines for new experiment

Column name:

  • experimentId
  • experimentName
  • experimentDescription
  • experimentSource mostly SRA, but can be GEO as well
  • experimentStatus can be "total" (all the samples were annotated) or "partial" (only some samples were annotated)
  • numberOfAnnotatedLibraries number of libraries reported in the annotation file
  • protocol see 'bulk_kits.csv' for a full information (restricted access)
  • protocolType
  • GSE alias to GEO project
  • Bioproject
  • PMID or DOID, reference to publication
  • PMID_url the url to easily access the publication
  • comment any information you consider useful
  • projectTags for tagging a big project or consortium, such as FAANG, PhyloFish, etc.

RNAseqLibrary.tsv

Annotation of the libraries; add new lines for new libraries

Column name:

  • libraryId is filled with the SRXid (or the ERXid or DRXid, depending on the source)
  • experimentId with the experiment ID (first column of RNASeqExperiment.tsv file)
  • platform
  • SRSId reporting SRSId allows us to check technical replicates: if the same SRSid appears associated with 2 different libraryIds, these libraries are technical replicates. In such case, you have to report a tag in column replicate to let merging these libraries later in the pipeline (for example, put '1' in column replicate for each of these 2 libraries). Biological replicates are the normal status of Bgee libraries. A 'normal' Bgee library is a library generated from one biological sample, not the results of the same prepared sample (SRSid) sequenced more than once.
    If the SRSid does not appear in the 'SraRun Table' for any reason, you can retrieve it by seaching directly the SRXid, here an example
    https://www.ncbi.nlm.nih.gov/sra/SRX7060169
  • anatId and anatName with the Uberon id and the Uberon name for the anatomical structure (see following section on anatomical structure annotation)
  • stageId and stageName with the stage ID and the stage name from the developmental stage ontologies (see below the section about developmental stage annotation). Be sure that the organ ID exists at the annotated stage. When a species does not have a dedicated developmental stage ontology, we report the developmental stages directly by the metastages of UBERON used by Bgee, see here top of the page:
    https://github.com/obophenotype/developmental-stage-ontologies/blob/master/external/bgee/report.md
    This page only reports the species currently present into Bgee. If a species is not yet into Bgee, please annotate with an Uberon metastages, and when this species will be integrated into Bgee, its taxon id will appear in this page. To be sure a species can have a particular stage, such as for example 'UBERON:0000070 pupal stage', you can check the most closest taxon id ('dog' will not have a pupal stage, while 'Chlorops oryzae' will have)
  • url_GSM is a useful direct link to the GEO page with sample information (provided in the output file of the script 'create_bulk_RNAseq_tables.py')
  • infoOrgan describe how the organ info is reported from the source (simply report the original info available, can be a copy/paste)
  • infoStage describe how the stage info is reported from the source (simply report the original info available, can be a copy/paste)
  • anatAnnotationStatus and stageAnnotationStatus describe how well the organ and stage annotation could be done considering the ontologies. These columns were devised to keep track of annotations that can potentially be modified whenever the ontologies are updated.

It can be:

    • perfect match - the annotation cannot be improved.

      Examples: the source reports the sample as striatum, the sample is annotated with UBERON:0002435 striatum.

      Examples: GEO doesn't provide information for the stage (unknown), the sample is annotated with the root of the ontology.

    • missing child term - the organ or stage info reported from the source is more precise than the ontology so it had to be annotated with a more general term, but the annotation could be improved if the ontology changed.

      Examples: the source reports the sample as lateral substantia nigra, the sample is annotated with UBERON:0002038 substantia nigra. the source reports the sample stage as 32 days (mouse), the sample is annotated with MmusDv:0000048 4 weeks.

    • other - any other cases. For example, it's a mixture of organs or stages and we had to report the annotation to a higher term in the ontology (a 'parent' term group just these terms is missing)

      Examples: the source reports the sample stage as 10-12 weeks (mouse), the sample is annotated with MmusDv:0000061 early adult stage.

  • sex is for the sex of the animal the sample was taken from. It can be M (male), F (female), NA (not available, unknown), mixed (both male and female).

  • strain is for the strain of the animal the sample was taken from (see below, strain annotation and harmonization section procedure).

  • genotype

  • speciesID for the taxon such as reported from NCBI Taxonomy database

  • protocol list of currently reported protocols: TruSeq RNA sample Preparation Kit; TruSeq Stranded mRNA; TruSeq Stranded Total RNA

  • protocolType list of currently reported protocol type: full_length

  • RNASelection list of currently reported RNA selection: polyA, ribo-minus, miRNA, lncRNA, circRNA

the information for RNASelection is sometimes only in publication, specially for ribo-minus information.

Note that 'protocolType=TruSeq Stranded Total RNA' is by default a 'RNASelection=ribo-minus', because it includes a Ribo-Zero kit, see here the Illumina product sheet

Other kits that mean ribosome depletion are: RiboMinus kit, RiboErase kit

  • globin_reduction is used to report information about blood samples where a globin depletion was applied before sequencing, see https://gitlab.sib.swiss/Bgee/expression-annotations/-/issues/74
  • replicate is to report technical replicates by taging the identical SRSid (see above, SRSId column)
  • sampleTitle
  • PATOid
  • PATOname
  • comment is for annotator's comments (free text). Anything special about the annotation of the experiment should be written there.
  • condition is to report any information about the sample condition
  • annotatorID is for the first two letters of the first name and the first letter of the surname of the person that annotated the sample. Don't forget to change this if you modify the annotation!
  • lastModificationDate is for the most recent date (format year-month-day) the annotation was modified. Don't forget to change this if you modify the annotation!

Automated pre-filling of annotation files

We have Python scripts that retrieve information from SRA and pre-fill some of the columns of the "Experiment" and "Library" annotation files: https://gitlab.sib.swiss/Bgee/scRNA-Seq/-/tree/main/scripts. These scripts allow us to considerable speed up the annotation process. For single-cell RNA-Seq experiments, a summary of the workflow is shown in the image below:

workflow

In short, first the "create_scRNAseq_tables" script is used to fill the "Experiment" and appropriate "Library" annotation files (depending on the type of scRNA-Seq protocol. Then, depending on the protocol you can use:

  • For full-length scRNA-Seq experiments, the "FullLengthlib_reclustering" script to complete the "Library" table with additional information such as cell types after clustering, if available.
  • For target-based scRNA-Seq experiments, the "create_scRNAseq_barcode" to generate a scRNAseqBarcode table and complete with aditional information if available.

The script "h5ad_to_tsv" can additionally be used to convert supplementary tables in h5ad format to tsv format.

See the script documentation for more information.

Anatomical structures annotation

We annotate using the anatomical structures from Uberon composite-metazoan ontology: http://purl.obolibrary.org/obo/uberon/composite-metazoan.obo

TIP: During annotation, use the search function to find the UBERON id code that corresponds to the anatomical structure sampled in the study. You can also open the annotation file in read-only mode and search if this structure has already been annotated in a previous experiment.

UBERON

Introduction:

Since Bgee v.12 we are moving our annotation of expression data (this section) and homology assertions (see similarity annotation) to Uberon, integrated cross-species ontology covering anatomical structures in animal. This chapter is to emphasize special rules of the annotation of expression data to Uberon.

Description of the format file for Uberon annotation (to use for all expression data annotation) is available here


It is really important, when annotating an experiment, that the annotated anatomical structure DOES exist at the annotated developmental stage.

When part of an organ was not included in the sample (brain without cerebellum for example), annotate with the organ (brain) and put 'other/partial sampling' for the annotation and biological status respectively (see organAnnotationStatus and organBiologicalStatus below). We might find a better way to convey this information later.

Special Cases

We annotate unfertilized egg with CL:0000025 egg cell.

Do not annotate kidney samples with UBERON:0002113 kidney (unless there is no indication of developmental stage and so there is no way to know which type of kidney it is). Annotate with UBERON:0002120 pronephros, UBERON:0000080 mesonephros or UBERON:0000081 metanephros. This is also valid for all kidney substructures (do NOT annotate with UBERON:0000074 renal glomerulus but with UBERON:0005325 mesonephric glomerulus unless there is no indication of developmental stage). Annotate a mammalian adult kidney with UBERON:0000082 adult mammalian kidney.

Muscle should be annotated with UBERON:0002385 muscle tissue. Skeletal muscle should be annotated with UBERON:0001134 skeletal muscle tissue. Uberon has several closely related terms: UBERON:0003663 hindlimb muscle, UBERON:0001383 muscle of leg, UBERON:0003270 skeletal muscle of leg... and there might be more. Be careful to choose the term that fits best.

Breast in human should be annotated as UBERON:0005200 thoracic mammary gland.

A sample of skin should be annotated with UBERON:0000014 zone of skin.

If infoOrgan=NA (missing info for the organ):

  • and infoStage=NA (missing info for the stage), annotate the organ with UBERON:0000465 material anatomical entity (see https://gitlab.sib.swiss/Bgee/expression-annotations/-/issues/25
  • and infoStage=adult, annotate the organ with UBERON:0007023 adult organism
  • and infoStage=embryo, annotate the organ with UBERON:0000922 embryo

Note that Uberon contains classes specific for 'insect', please check before annotating FBbt anatomical structures, for example Drosophila adult head, to annotate with UBERON:6003007 insect adult head.

Whole organism adult or whole body adult should be annotated with UBERON:0007023 adult organism. Follow same logic for embryo, larva...

ISSUE: some Drosophila annotation (RNA-seq) are done on Carcass, while actually all anatomical structures are de facto dead... (The organ info 'Carcass' is so far (August 2017) only retrieved for Drosophila, RNA-seq data)

--> TO REANNOTATE ?

[Term]

id: UBERON:0008979

name: carcass

namespace: uberon

def: "A body of a multi-cellular organism that is no longer living." [UBERON:cjm]

synonym: "cadaver" RELATED [BTO:0001965]

synonym: "dead body" RELATED [BTO:0001965]

xref: BTO:0001965

xref: C113674

xref: http://www.snomedbrowser.com/Codes/Details/127853004

is_a: UBERON:0000468 ! multi-cellular organism

relationship: has_quality PATO:0001422 ! dead

→ see issue https://gitlab.sib.swiss/Bgee/expression-annotations/-/issues/16

Broca's area annotated with UBERON:0001870 frontal cortex

Hippocampus annotated with UBERON:0001954 Ammon's horn


ANiknejad 09:15, 24 January 2013 (UTC)how to annotate snout epidermis ? with UBERON:0001003 epidermis or with UBERON:1000015 skin of snout ? in each case will be + missing child term

ANiknejad 13:53, 5 February 2013 (UTC) cf MRR's mail 5th Feb, caution about mammalian blastocyst (UBERON:0000358) and metazoan blastula (UBERON:0000307)

"The point is that the blastula and the blastocyst have different names, not because of some accident of history of naming these structures, but because they are fundamentally different embryological structures. The blastocyst is a mammalian innovation, which consists of an early separation of embryoblast and trophoblast (part of extraembryonic tissue). Zebrafish does not have extraembryonic tissues at all (except for the yolk sack) since it is not an amniote."

Developmental stages annotation

We annotate using the species specific developmental ontologies that are hosted at https://github.com/obophenotype/developmental-stage-ontologies/tree/master/src

Have a look at the wiki documentation (by Chris Mungall)

 https://github.com/obophenotype/developmental-stage-ontologies/wiki

This is a repository for the collection of species-specific stage ontologies, primarily for the developers of these ontologies. Uberon (Chris Mungall) and Bgee team are the main contributors.

All links are not necessary well registered:

use this link for the last official Zebrafish anatomy and development ontology.

use this link for the last Xenopus anatomy and development ontology.

For Worm developmental stage ontology see here the ontology or here the related wiki

Where to find developmental ontologies information

ZFS developmental staging series here

Miscellaneous

When there is no indication of a stage, write NA for infoStage and annotate with the root of the developmental ontologies( perfect match partial sampling)

Note that 'full sampling' is not applied to 'stageBiologicalStatus'. 'Partial sampling' is used to highlight 'other' or 'missing child term', otherwise 'not documented' is applied to 'stageBiologicalStatus' by default.

If the experiment is a mix of different stages, it must be annotated with the stage that is higher in the ontology and includes the different stages. For example: a mouse experiment that mixes Theiler Stages 11 (neurula) and 15 (Embryonic age 9.5 dpc, during organogenesis) must be annotated with embryonic mouse stage (MmusDO:0000002)

A link to Carnegie Stages and species comparison, cf http://php.med.unsw.edu.au/embryology/index.php?title=Carnegie_Stage_Comparison

Be careful, same names in different ontologies do not cover same period of time. For instance 'adult' is the mature stage in human, but in other ontologies do not confuse with 'adulthood' which has time-interval (means in fact 'breeding time'). So 'adult' without time information has to be mapped on 'mature stage' in ontologies other than Human.

Special cases

Note that currently 'root' of species specific developmental stages ontologies (named 'xxx life cycle') are mapped to UBERON:0000104 life cycle, but there is an exception with WBls where Wbls:0000002 (Name: all stages Ce) is mapped to UBERON:0000105 life cycle stage, so in case there is no information about C.elegans stages, please annotate directly with UBERON:0000104 life cycle

how to deal with age continuation: the is some overlap between developmental stages (e.g. 4 weeks old = over 28 and under 35 days; 5 weeks old = over 35 and under 42 days). in these scenarios we round up (e.g. you sample is 35 days we annotate as 5 weeks old). generally, we treat the lower number as inclusive and the larger number as exclusive (4 week old stage is applicable when: 28 days <= age < 35 days).

28 years since birth annotated with HsapDO:0000122 (28 years) is tagged perfect match (stageAnnotationStatus) and not documented (stageBiologicalStatus)

21 and 29 and 51 and 55 years since birth annotated with HsapDO:0000087 (adult stage) is tagged other (stageAnnotationStatus) and partial sampling (stageBiologicalStatus) (other means here missing parent)

How to annotate human prenatal stages

It is sometimes difficult to assert the human 'weeks' for embryo and fetus: weeks can be week of gestation (=starts from the LMP, last menstrual period) or week of development (=start from the fertilization time). Usually embryo are described by 'week of development' regarding Carnegie stages description, and fetus are mostly described by 'week of gestation', the usual medical term for human pregnancy. So without any other information, a 'fetus, 22 wks' could be either 22 wks post-fertilization or 20 wks LMP, check the human stages ontology to see that HsapDv:0000200 sixth LMP month includes both possibilities (19-23 weeks post-fertilization = 20th to 24th week post-fertilization). A 'fetus, 24 wks' will be on the other hand mapped on HsapDv:0000037 fetal stage, because overlapping HsapDv:0000200 sixth LMP month and HsapDv:0000201 seventh LMP month

Carnegie stages are given in days or weeks post-fertilization (precise developmental time), but usually literature provides gestational time. There is a difference of 2 weeks, here an example of conversion from week gestational time to 'wpf' = week post-fertilization, and mapping.

gestational age 9.6 (7.6 wpf = 53.2 days) -> HsapDv:0000028 Carnegie stage 21 (pro)perty_value: start_dpf "53.0" xsd:float

Questions to discuss and answer in the future

For now we have reannotated with Uberon using embryonic stages and not adult as before, and the sex of the embryo if known.

ANiknejad 13:14, 27 August 2013 (UTC) sounds like it is not the case for all samples in the 'affymetrixChip OK' sheet, when no information is given about the gestational time, the organ (placenta) is reported on the mother stage

  • How do we annotate when there's no information given about the stage but it's human volunteers or pregnant mice? Should it be automatically adult for human samples? Sexually mature for mice when pregnancy is mentioned? YES, sounds logical for pregnant organisms.

For now these samples have been annotated as unknown (root of the developmental stage ontologies and perfect match/partial sampling) → TO CHECK

How to annotate terms absent from the ontologies

If an anatomical structure does not exist in Uberon ontology, we can request a new term via Uberon issue tracker

 https://github.com/obophenotype/uberon/issues

If a developmental stage does not exist in the current ontologies:

  • human (hsapdv.obo)
  • mouse (mmusdv.obo)
  • rat (rnordv.obo)
  • cow (btaudv.obo
  • lizard (acardv.obo)
  • chicken (ggaldv.obo)
  • gorilla (ggordv.obo)
  • opossum (mdomdv.obo)
  • rhesus macaque (mmuldv.obo)
  • platypus (oanadv.obo)
  • bonobo (ppandv.obo)
  • orangutan (ppygdv.obo)
  • chimpanzee (ptrodv.obo)
  • pig (sscrdv.obo)
  • medaka (olatdv.obo)
  • platynereis (pdumdv.obo)

, edit the corresponding .obo file

 https://github.com/obophenotype/developmental-stage-ontologies/tree/develop/src

For new term in ZFS, create a new issue here:

 https://github.com/obophenotype/developmental-stage-ontologies/issues

and cc following people: [email protected], [email protected], [email protected],

For new term in XAO, ask here:

 https://github.com/xenopus-anatomy/xao/issues

For new term in FBdv, use:

 https://github.com/FlyBase/drosophila-anatomy-developmental-ontology/issues

For request on WBls (C.elegans) you can try contacting here:

 https://wormbase.org//#012-34-5

Normality criteria and accepted technologies

Data "normality"

In some cases we had to make a choice whether including or not an experiment. Here are some choices we have made, to serve as a guide if the same cases appear again:

Note that depending on the data type (Affy, RNA-seq, scRNA-seq), these criteria are more or less stringently considered (Yes/No), due to data availability (RNA-seq data are rare on some model organisms).

For GTEx RNAseq data, see this document for cleaning process (for example, we excluded libraries from human with BMI greater than 35, that means, we excluded morbidly obese people but included 'normal' obese people)

Experiment Include into Bgee? Comments Example
BMI (Body Mass Index) from 18.5 to 25 (or less than 30) (or less than 35) Yes In the normal range or overweight (overweight is common), see https://en.wikipedia.org/wiki/Body_mass_index to exclude other ranges (Sept, 2019, Update: see comment here above about GTEx samples, we have to consider BMI value is continuously rising inside human population). Actually, weight difference between individuals may be part of the natural variability SRP046752
Cell lines (3T3-L1, Hela, MCF-7) and cell cultures No
Fasted animals Yes If the fastening time is reasonable: a mouse fasting duration of 5–6 h might offer a better comparison to humans overnight(16–18 h), see PMID:24025567 E-GEOD-7137
Dark/light circadian rhythms and temperature variation yes/no Depends on the animal, if reasonable for its physiology, yes as in the example GSE23528
Low or high fat diet for short time Yes If the diet time is short (3 days), can be considered as part of the wild life variability for animals E-GEOD-8524
Mammary glands from virgin, pregnant and lactating females Yes From all types of females E-TABM-199
Oocytes at different stages of maturation Yes Without including the info on the stage (except for Drosophila where the different maturation stages are present in the ontology) E-GEOD-3351
Placenta and extraembryonic components during development Yes Be careful whether to put it in the adult or the embryo! E-GEOD-7674
Injury No
Animals selected for their behaviour (e.g. fear) Yes Part of the natural variability E-GEOD-4035
Animals from different strains (e.g. C57BL/6, BALB/c...) Yes Part of the natural variability, see also the section below 'strain annotation and strain harmonization'
Intestinal germ free animals No Normal animals have an intestinal flora E-GEOD-5156
Removal of the eye (monocular enucleation) or cochlea on one side; the eye, visual cortex or cochlea on the other side were analysed No E-GEOD-4265
Cell types (T-cells, stem cells...) Yes Should be included only if enough precision in the ontologies (e.g. T-cell in zebrafish). If not enough precision in the ontology, store the experiment ID in the "not_included_for_now" file
Polysomal RNA only hybridized No Method that pellets the polyribosomes while leaving the mono and non-polysomal mRNA fractions in the supernatant. E-GEOD-3962
Anesthesia No/Yes Similar to drug treatment (but sometimes anesthesia is simply not described, and anyway has to be accepted for human tissue samples)
Human post-mortem tissues Yes Mainly the simple way to get human tissues
Light impulse to stress the animals Yes Stress is probably usual for lab animals
Killed by cervical dislocation or decapitation Yes Common for Mouse
Killed by inhalants (CO2) Yes Common for Mouse
Killed by exsanguination under CO2 anesthesia No Used for mouse lung retrieval
Killed by intravenous anesthetic No Method of killing is sometimes simply not described
Killed by intracardiac or intraperitoneal injection No Method of killing is sometimes simply not described
Mock-treated (plasmid, surgery...) No/Yes Typical 'control' that is not normal condition, to check. Can be acceptable when this is for example 'mock inoculated in a similar manner with Minimal Essential Medium' (SRP061418), but rejected when involving experimental surgery (SRP039511) SRP061418, SRP039511
Normal adjacent tissues from tumor No/Yes Sometimes difficult to get the information, be careful if both tumor and normal samples come from the same patient (paired samples). Also sometimes we consider the data because no other data are available for these tissues (described in comments)
Treatment with Alkaline Hypochlorite Solution ("Bleaching") of C.elegans worm: larva are collected Yes Use for synchronizing C. elegans culture, see here
Transgenic strains No/(Yes) Not natural strains. We exclude all induction lines where an injection is needed (e.g. tamoxifen), but depending on the data type, we accept now (June 2022) constitutive reporter genes (e.g. GFP, YFP) used in interesting/big experiment (see the dedicated wiki for scRNA-seq, and see previous Slack discussion here). Accepted scRNA-seq experiments SRP109266, SRP108034; rejected bulk RNA-seq experiment SRP016132 SRP109266, \ SRP108034, \ SRP016132 and PMID:23284293

Accepted technologies

As a curator we need to know the protocols and treatments that are acceptable for Bgee pipeline at time we perform the annotation, both for bulk RNA-Seq and scRNA-Seq.

  • A list of bulk RNAseq current technologies is available here

  • The list of biotypes (as defined by Ensembl) currently in Bgee pipeline is available in this file

  • Criteria to accept/reject SRA libraries are reported in this google spreadsheet, valid for both bulk RNA-Seq and scRNA-Seq.

  • Single-cell protocols summary is available here

Strain annotation and strain harmonization

We report original information in our annotation files, but there is often discrepancy in original data to describe the same strain. We then use a controlled vocabulary to report strains, and we follow the UniProt process for strain annotation (referee at Swiss-Prot is [email protected]):

Procedure: during annotation process, each strain in its original presentation is compared to ‘strains.tx’ file. We aim to report the preferred term, the one not between square brackets. Be aware, the same strain name is sometimes used for different species: check the species name first. If there is no match between the strain name you have to report and the file 'strains.tx’, simply report the information as it is. We have generated a mapping file to help harmonizing the strain annotation, see 'StrainMappingFile.xlsx'.

(It may occur that some genotypes are actually reported in the strain column for old annotated expression data, because we introduced the genotype column only when starting annotation of single-cell RNA-seq, see the dedicated wiki for scRNA-seq. The 'StrainMappingFile.xlsx' may process this information too.)

Note that a 'subspecies' (as defined by NCBI Taxonomy (https://www.ncbi.nlm.nih.gov/taxonomy)) is reported as strain, with the parent species being the 'speciesId'

We also used the literature to help define species-specific list of strains, here below the links to the resources we used as reference:

For Human:

FDA suggested minimal set of standard terms

AMERICAN INDIAN OR ALASKA NATIVE, ASIAN, BLACK OR AFRICAN AMERICAN, NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER, and WHITE

https://www.pharmasug.org/proceedings/2015/SS/PharmaSUG-2015-SS06.pdf

For Mouse:

https://www.ncbi.nlm.nih.gov/pubmed/18432639

For Drosophila:

https://bdsc.indiana.edu/stocks/wt/wild-type.html

https://wiki.flybase.org/wiki/FlyBase:Stocks

For Zebrafish (7955):

https://zfin.org/action/feature/wildtype-list

For Macaca mulatta:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5131817/

FISHES, see https://www.fishsource.org/ or https://www.fishbase.se/summary/3228

Poecilia reticulata (guppy):

https://onlinelibrary.wiley.com/doi/full/10.1111/j.1558-5646.2011.01495.x

Natural guppy populations can be divided into two ecotypes (Endler 1995; Reznick et al. 1996; Magurran 2005). High-predation populations are usually found in the downstream reaches of rivers, in which they coexist with predatory fishes that have strong effects on guppy demographics (Reznick et al. 1996, Rodd and Reznick 1997). Low-predation populations are typically found in upstream tributaries above barrier waterfalls, in which most predatory fishes are absent. Guppy coloration is subject to strong natural and sexual selection in Trinidadian streams.

Medaka (oryzias latipes):

https://www.ebi.ac.uk/birney-srv/medaka-ref-panel/about_medaka.html

https://shigen.nig.ac.jp/medaka/

see also

https://www.sciencedirect.com/science/article/pii/S0925477304000954

Dog (Canis lupus familiaris):

https://www.akc.org/dog-breeds/

Horse:

The first horse genome sequenced was a Thoroughbred [1], but recently several other breeds have been sequenced, including the Arabian, Duelmener, Hanoverian, Icelandic, Norwegian Fjord, Przewalski, Quarter Horse, Sorraia, and Standardbred [3–5]

http://www.bigrunwolfranch.org/horses.html

https://www.cotswoldfarmpark.co.uk/farming/animal-breeds/

Farmer animals, see:

http://dagris.ilri.cgiar.org/fr/about

Rabbit:

https://arba.net/

Miscellaneous

Annotating other data types

EST data

see also the Tips from the expert

  • Consider UniGene EST library for the given species

Keyword: uncharacterized histology : DO NOT CONSIDER the library, as this information is never co-present with Keyword: normal, so ONLY consider library with Keyword: normal

  • Check that the condition is normal. To get more informations, you may need to refer to the file library.report where all details on the UniGene library are displayed. Be careful that in this file, the library ID is not the number just after the ">", but the number after "dbEST lib id: ". Note that the UniGene description is reported in Bgee, so if conflict between library.report file and UniGene, consider UniGene information to be consistent (ex: dbEST lib id: 12373, Keyword: embryonic tissue, Description: Non-normalized full-length enriched library from pooled mouse embryonic limb, maxilla and mandible, so mapped on embryo AND NOT on oral region, even if Tissue description of the library.report file only talks about maxilla and mandible)
  • Fill the file "annotation_libs__species_.txt":
    • First column is the library ID.
    • Second is the stage ID.
    • Third is the organ ID.
    • Fourth is the libary name.
  • If the library can't be included ("anormal"), add its ID to the "not_included" file, as well as a short description why it can't be included. If it could be included, but for any reason the insertion is not possible, store the library ID and a description of the problem in "not_included_for_now" file.
  • The file "not_annotated_libs__species_.txt" contains the library ID and number of sequences for unannotated librairies. Most of them were not annotated because they have a small number of sequences.
 Human EST lib were roughly mapped on new dev Human ontology for Bgee v.11 (we missed EST re-mapping...), 
 then the mapping was done after v.11 and included in v.12

smiRNAdb

  • extra/pipeline/curation/miRNA/mapping_libs_smiRNAdb.xls
  • data file to download and check (information (if any) is in column C)
 http://www.mirz.unibas.ch/cloningprofiles/resources/S.xls 

  • organ list on the site (click on 'sample composition')

http://www.mirz.unibas.ch/cloningprofiles/

Tips from the expert

Mouse

  • NOD mouse = Non-Obese Diabetic mouse, SCID mouse (immunodeficient), Satin, beige and SB/LeJ are mutant mouse strains.
  • Mouse developmental stages can be described as E15.5 (example) that means 15.5 dpc (days post-coitum, days after fertilization), you have to use the corresponding TheilerStage (in this example E15.5 is still related to TheilerStage23 (15dpc)).
  • Mouse developmental stages and number of cells, see here


Human

  • Human EST data, may be useful to have a look at that site here
  • For annotation of human prenatal stages, please see the specific section
  • It is sometimes difficult to assert the human 'weeks' for embryo and fetus: weeks can be week of gestation (=starts from the LMP, last menstrual period) or week of development (=start from the fertilization time). Usually embryo are described by 'week of development' regarding Carnegie stages description, and fetus are mostly described by 'week of gestation', the usual medical term for human pregnancy. So without any other information, a 'fetus, 22 wks' could be either 22 wks post-fertilization or 20 wks LMP, check the human stages ontology to see that HsapDv:0000200 sixth LMP month includes both possibilities (19-23 weeks post-fertilization = 20th to 24th week post-fertilization). A 'fetus, 24 wks' will be on the other hand mapped on HsapDv:0000037 fetal stage, because overlapping HsapDv:0000200 sixth LMP month and HsapDv:0000201 seventh LMP month

Drosophila

  • Drosophila is a holometabolous insect with 4 development stages: egg (24h->)larva (5 days->)pupa (5 days->) adult. The larva stage is sub-divised into two steps: larva + 24h and larva + 48h. Hatching process (from egg) leads to a larva, eclosion process (from pupa) leads to an adult.
  • Unfertilised eggs are annotated with development stage FBdv:00005287 (unfertilized egg stage)
  • Stage 14 oocytes are annotated with development stage FBdv:00005369 (adult stage) and the female reproductive system structure FBbt:00005283: stage S14 oocyte E-MEXP-2746
  • Drosophila cycles and stages, see here
  • We do not consider GFP-labeled clones E-GEOD-23344

Zebrafish

  • Follow this link for listed wild-types
  • Follow this link for matching hpf (=hours post fertilization) and development stages

C.elegans

Mapping BDGP terms to FBbt terms

http://insitu.fruitfly.org/cgi-bin/ex/insitu.pl

New BDGP terms have to be mapped to FBbt terms at each Bgee release.

See extra/pipeline/curation/BDGP/BDGP_terms_to_FBbt_terms.xls

Column G (match_source)

  • "bgee" means new term automatically mapped based on name (to check anyway)
  • "nomatch" means new term to be mapped manually on FBbt term

How to map?

Use OLS - Ontology Lookup Service to easily search the BDGP term on FBbt ontology (automatic completion in the 'Term Name' field). Note that FBbt ontology does not have stages associated to anatomical structures, anyway developmental information are sometimes available in the "definition" of the structure. Also this BDGP page explains how to distinct the different structures regarding stage-range. Looking at parents of a FBbt structure allows to check if the stage is correct (an 'anlage' should be parent of a 'primordium' for instance).

Note that all BDGP terms can not be mapped on FBbt terms. Some terms are only information on the staining (ex. 'ubiquitous'). It is possible to access the images corresponding to the BDGP_id (see example here), but it is still tricky to perform mapping depending on these images.

Export the xls file into txt file

At each new release of Bgee, you need to export the xls file into a txt file, for the developers to integrate your work into the database.

Please use OpenOffice instead of Excel to do it. Excel generates too many weird things during the export, add quotes around every text area, etc.

  • Open your xls file in OpenOffice
  • File > save as
  • file type: choose "text CSV (.csv)"
  • Click "save as"
  • Confirm that you want to export in text file
  • Choose the appropriate chars encoding (UTF-8 for instance)
  • Choose {tab} as field separator
  • leave empty the text separator (no quotes nor single quotes)
  • save the file

Retrieve new experiments from ArrayExpress

  • Go to: bgee/extra/pipeline/curation/Affymetrix/get_new_experiments/
  • Retrieve xml files with all experiments for the different species

wget http://www.ebi.ac.uk/microarray-as/ae/xml/experiments?species=Homo+sapiens

  • Rename the file in "Homo_sapiens.xml"
  • Repeat the same procedure for the other species in Bgee
  • Check the xml files, they should be larger than the previous ones
  • svn update to get the last version of the file annotation.xls
  • From annotation.xls, copy and replace each file (present on the svn) "microarrayExperiment", "not_included" and "not_included_for_now" with the updated content. These 3 files should contain all experiments already included into Bgee
  • ./change_end_of_line.sh to change \r to \n in the files
  • perl parse_experiments_xml_new.pl Homo_sapiens.xml to create a file Homo_sapiens.out with new experiments. Some warnings can appear ("... element has non-unique value in 'name' key ..."), this is not important for us (as we can store only one name in Bgee).
  • Append this file to the tab "Homo sapiens" (where all experiments are listed) in annotation.xls
  • Repeat this for other species.

Retrieve papers with expression data

TIP: Make sure that you have access to the UNIL VPN with Pulse Secure to have access to papers when not on campus.

What to search?

  • Start by searching the exact title of the study in SRA or GEO
  • Search for the species name AND its common name. Most searches use quotes to find exact phrases. Use quotes around the species name (e.g. "Oryzias latipes") to avoid finding papers on close species
  • Use RNA-Seq or other equivalent keywords
  • Search for the name of the institution which submitted the data to the SRA or the author if available
  • Search for the grant number of the study if available
  • Studies with similar protocols that where submitted in a relatively short time span may have been published by the same authors
  • Use logical operators (AND, OR) to link conditions with quotes

Looking for papers with in situ expression data: Search for hybridiZation AND hybridiSation. Use quotes to search the exact phrase "in situ hybridization" and avoid retrieving papers containing "in situ" in another context. Search for "in situ" placed before AND after hybridization.

Where to search?

"PubMed is a database of citations and abstracts for biomedical literature from MEDLINE and additional life science journals." PubMed only searches titles and abstracts, it does not allow to search articles full text. In situ hybridization is not always mentioned in the title or abstract, hence PubMed misses some relevant articles.

Possibility to export citations in summary (text), MEDLINE and csv formats.

Searches can be saved and e-mail alerts can be set up for saved searches.

PubMed search specificities: Terms in the search box are matched against a MeSH (Medical Subject Headings) translation table. If a match is found in this translation table, the term will be searched as MeSH (that includes the MeSH term and any specific terms indented under that term in the MeSH hierarchy), and in all fields.

Always check the MeSH terms that were matched to the searched term as it's not always a correct match for your search (e.g. stickleback matches the MeSH term Smegmamorpha [2], which is too general; hybridization matches the MeSH term "hybridization, genetic" so PubMed will search for the MeSH term as well as the word "genetic" and the exact phrase "genetic hybridization" in all fields, which is not relevant in the case of a search for in situ hybridization).

Always check the Search details box on a PubMed results page to verify how the searched terms were translated and what search exactly was run.

If the MeSH term is a correct match, search as a MeSH term (e.g. "In Situ Hybridization"[MeSH Terms] ) AND in all fields (e.g. "In Situ Hybridization"[All Fields]). Some papers have not been annotated with the MeSH terms but the title/abstract contains the searched term; some papers have been annotated with the MeSH term while the title/abstract does not contain the searched term.

By default PubMed also searches the terms indented under a term in the MeSH hierarchy. In Situ Hybridization, Fluorescence and Primed In Situ Labeling are found below the MeSH term In Situ Hybridization in the MeSH hierarchy but these terms are not relevant to our search. To search as a MeSH term without including the indented terms use the following syntax: "In Situ Hybridization"[Mesh:NoExp]

Several archives of journals offer the option to search articles full text and can help retrieve papers that PubMed misses. The following archives have little overlap in the journals they search.

  • PubMed Central [3]

"PubMed Central (PMC) is the U.S. National Library of Medicine's digital archive of life sciences journal literature. PMC contains full-text manuscripts deposited by authors or articles provided by the publisher." PubMed Central has an option to search articles full text (Body -All words).

Possibility to search Figure/Table caption and Methods - Key terms but searching "in situ hybridization"[Methods - Key terms] retrieves no results.

Possibility to export citations in summary (text), MEDLINE and csv formats.

Searches can be saved and e-mail alerts can be set up for saved searches.

  • ScienceDirect [4]

Possibility to search full text (excluding references) with the following syntax in expert search: Full-Text (search term)

Possibility to search images (however it's impossible to export the results citations of an image search)

Possibility to export citations in RIS format and plain text. There is currently a bug in their system which only allows a maximum of 20 citations being exported, even when more citations are selected. This issue is scheduled to be fixed in the first release of this year, which will take place in March.

Searches can be saved and e-mail alerts can be set up for saved searches.

  • Springerlink [5]

Possibility to search full text

Possibility to export citations in csv format

Searches can be saved and e-mail alerts can be set up for saved searches.

  • Wiley Online Library [6]

Possibility to search full text

Possibility to export citations in RIS format or plain text. But at the moment the Select All box does not seem to work so it's impossible to export all citations unless they're checked one by one. I've contacted their support team about it.

Searches can be saved and e-mail alerts can be set up for saved searches.

Wiley makes itself the conversion between UK and American English, searches for hybridization and hybridisation retrieve the same number of results.

  • Highwire Press Stanford University [7]

Possibility to search full text (Anywhere in text)

Possibility to export citations in RIS format. However, there's no option to export all citations in one go, the only possibility is to check and export all citations displayed on the page.

It's not possible to save searches but e-mail alerts with the same options as searches can be set up.

  • Google Scholar [8]

Searches full text; orders papers by citations, but possibility to restrict to recent papers; not limited to real peer-reviewed articles, so some caution should be exercised, but very complete. Since it's Google, corrects for spelling mistakes and such (including US/UK spelling).

To know how the searches for each species on each archive of journals were done check the file "in_situ_paper_searches_for_bgee_species.docx" on the SVN.

⚠️ **GitHub.com Fallback** ⚠️