scRNA seq - BgeeDB/expression-annotations GitHub Wiki

Dedicated page for scRNA-seq expression data annotation.

Follow this link for general annotation guidelines in Bgee.

Introduction

scRNA-seq allows to access the transcriptome heterogeneity at cell levels, describing cell types.

https://www.nature.com/articles/s41368-021-00146-0

scRNA-seq methods can be low throughput (also known as plate-based methods) or high throughput (also known as droplet-based methods), and may concern only nuclei (single-nuclei) instead of the whole cell, see this review for detailed information.

There are many protocols to generate scRNA-seq libraries (see here for a guide with essential information on 30 single-cell protocols), but all follow the same workflow:

  • cell isolation

  • library preparation

  • sequencing

scRNA-seq annotation

We annotate full-length (FL) scRNA-seq datasets, and target-based (TB) scRNA-seq datasets in a common library annotation file, 'scRNASeqLibrary_merged.tsv'.

Full-length scRNA-seq libraries are single-cell libraries that each contains a unique cell type. The cell type is either known a priori, or can be defined more precisely a posteriori by clustering. These libraries of full-length experiment are annotated with cell type mapping in the common annotation file 'scRNASeqLibrary_merged.tsv'.

Target-based scRNA-seq libraries are single-cell libraries that each contains more than a cell type, and to report and further annotate each cell types, it is necessary to have barcodes/UMI linked to each individual cells. See here detailed information about barcodes. It may happen that the same barcode reports different cell types in the same experiment: such case is normal, as the number of available barcodes is somehow limited. However inside a library, each barcode has to be unique to a cell type. A clustering information may be associated to target-based experiment and allow to define precise cell types. As a consequence of barcode/UMI linked to each individual cells, target-based (TB) experiments have in addition a separate 'barcode' file, where the cell types information are reported and mapped.

Identifying single-cell datasets to annotate

Cell isolation is the first step in the workflow and is crucial to determine if the dataset does actually match the general rules of Bgee normality. As explained below, we had however to reconsider our normality rules for scRNA-seq annotation, because many protocols use transgenic strains to facilitate cell isolation (even if cellular side effects resulting from GFP expression for example have been reported, as described in this paper).

  • The ideal isolation protocol is using antibodies staining, then FACS or MACS protocols that isolate cells from tissues, rather in a mechanical way.
  • We may do accept transgenic cells with constitutive reporter genes (e.g. GFP, YFP, dtTomato) if there is a big gain of interesting samples (see below integrating data from big consortium).

Using DNA recombinant technology, scientists combine the Gfp gene to a another gene that produces a protein that they want to study, and then they insert the complex into a cell. If the cell produces the green fluorescence, scientists infer that the cell expresses the target gene as well

source:https://embryo.asu.edu/pages/green-fluorescent-protein

  • We aim to follow protocols used by dedicated cell atlas/single-cell expression data consortium, such as the STAR protocol for 'Isolation and RNA sequencing of single nuclei from Drosophila tissues', used in Fly Cell Atlas, FCA, but we are still focused on our normality rules, and so we report the genotypes in order to have the possibility of further filtering on genotypes clearly away from the 'wild type' genotype.
  • We may then accept transgenic strains with driver line such as the GAL4/UAS system; see this picture of Drosophila usual cell staining protocol for cell isolation.
  • We do not accept induction lines: CRE-inducible protocols that involve injection of inducers such as tamoxifen to activate reporter genes are rejected.
  • We do not accept cells from culture or cell lines.

As part of the improvement of our annotation process, we report 'genotype' in the FL and TB annotation files in addition to the strain information.

format of the annotation files

table format for scRNASeqLibrary_merged.tsv file

Column name:

  • libraryId
  • experimentId
  • platform
  • SRSId
  • anatId, the identifier used in mapping the anatomical structure, usually UBERON id
  • anatName, the name associated to the anatID
  • cellTypeId, the cell type ID to insert in the Bgee database. It can be the ID of the term present in the infoCellType_abInitio column if no infoCellType_inferred is present. If a infoCellType_inferred is present then the cellTypeId corresponds to the ID of the term present in this column.
  • cellTypeName, the cell type name of the cellTypeId
  • stageId
  • stageName
  • url_GSM
  • infoOrgan
  • infoCellType_abInitio, ab initio info from SRA: cell type information provided by the authors before any original analysis that may allow precising/defining the cell type (clustering for instance, in full-length experiment, or barcode processing in target-based experiment)
  • infoCellType_inferred, inferred (clustered) annotation from the paper or sometimes from SRA
  • clusterId
  • clusterName
  • infoStage
  • anatAnnotationStatus
  • cellTypeAnnotationStatus
  • stageAnnotationStatus
  • sex
  • strain
  • genotype
  • speciesId
  • RNAseqTags
  • protocol
  • protocolType
  • lib_name
  • sampleTitle
  • comment
  • condition
  • annotatorId
  • lastModificationDate

table format for barcode file

Each target-based experiment does have library information reported into the scRNASeqLibrary_merged.tsv file, but also does have a dedicated barcode file, 'scRNASeq_barcode_EXPID.tsv' where EXPID is the identifier for the experiment. This file is the final annotation file for target-based experiment, with cell type description of the different cells present in each library.

  • barcode
  • cluster
  • library
  • experiment
  • tissue
  • cell_type
  • anatId_a_posteriori, the ID of the anatomical entity that the cell type may determine as being other than the tissue in the library file
  • anatName_a_posteriori, the name of the anatomical entity ID determined by the cell type, if any
  • anat_a_posteriori_annotationStatus
  • cellTypeId
  • cellTypeName
  • cellTypeAnnotationStatus
  • name_Library
  • comments

how to map the cell description to the Cell Ontology

The annotation work itself consists mapping the cell type reported by the authors to the classes available in the Cell Ontology, or maybe in the Provisional Cell Ontology.

Sometimes this is not a cell type reported as the final annotation by the authors, but instead a tissue. In such a case, either we can consider the final identified cell type as a 'somatic cell' and map to CL:0002371 ! somatic cell, (or 'germ cell', to map to CL:0000586 ! germ cell, or even to CL:0000015 ! male germ cell/CL:0000021 ! female germ cell) or we can specify a bit the cell depending on the tissue, but carefully. For example, 'liver' has still to be reported as 'somatic cell' because it exists cell types other than 'hepatocyte' in a liver tissue. But for 'brain' for example, we can report to CL:0002319 ! neural cell because of its broad definition: "A cell that is part of the nervous system."

For annotation of unknown cell types, CELLxGENE has different rules, see https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/4.0.0/schema.md#integration-metadata

Special cases for cell type annotation in barcode file:

  1. It may happen that the cell type info provided by the authors by using barcodes talks about a cell that is known so far to belong to a tissue different from the one reported in the annotation file: it could be a potential contaminant cell, or a misdetected cell. Or maybe this cell type was so far not detected in this anatomical structure. Or we need to update the cell type relationships in the CL ontology.

In such case, we reannotate the tissue by filling-up the column names 'anatId_a_posteriori' and 'anatName_a_posteriori' in the barcode file. Another solution can be reporting a new tag in the 'cellTypeAnnotationStatus' column, for example 'dubious cell type'.

  1. The cell types provided by the authors by using barcodes can sometimes infer a more precise organ as compared to the library annotation, in such case the 'anatId_a_posteriori' is simply a sub-part of the anatomical entity reported in the library annotation file. We reannotate the tissue by filling-up the column names 'anatId_a_posteriori' and 'anatName_a_posteriori' in the barcode file.

Note that Bgee pipeline only consider cell types mapped in the barcode file, if this file exists for an experiment (target-based experiment).