Practical tips for annotation - BgeeDB/expression-annotations GitHub Wiki

Contents

This is a brief overview of the curation job at Bgee, done by a student assistant, reporting the issues she faced when starting biocuration, and tips she provides for a beginner. Let consider this page as a quick-start guidelines of Bgee annotation. For full information, please refer to the Annotation Guideline, and to the dedicated page for single-cell RNAseq annotation.

Practical tips

Before annotation

Do a pull on GitLab/SourceTree!

Where to find experiments to annotate?

  • From PubMed of Google Scholar using appropriate key words for the method/organism
  • From large portals such as Human Cell Atlas, or other databases
  • From the slack channel "datasets_to_annotate"

Can I annotate this experiment?

  • Does it follow Bgee's normality criteria? (Were tissues from healthy or control organisms sequenced in this experiment? If the disease did not concern the tissue sampled, could it still affect it (for example cancer)? If mutations were introduced in the organism to facilitate the study, are those acceptable (no induced Cre mutations or temperature-sensitive mutations for Drosophila)?
  • Is the sequencing protocol used acceptable? Check the list of experiments already annotated and confirm with the team before annotating experiments based on new protocols or sequencing platforms. For example for scRNA-Seq, Bgee is focusing on Smart-Seq and C1 full-length technologies and 10X Genomics target-based technologies. Some protocols such as VDJ libraries or tag-seq are also excluded.
  • For target-based scRNA-Seq experiments, can you find information on where the barcode annotations are located?

Reading the paper related to the experiment is always a good idea to confirm that it can be annotated as the protocols of the studies are sometimes not accurately or fully reported on SRA.

Looking for barcodes?

Unfortunately barcode - cell type annotations can be surprisingly hard to find for a lot of experiments. They can be provided in form of tables, H5AD files, or R data (.rds). Here are some options when looking for them:

  • On GEO: If the experiment libraries are also in GEO, supplementary files are provided at the bottom of the page, often as a tar archive, and you can download them to check for barcodes. MTX files and .tsv tables indicated by genes.tsv and barcodes.tsv usually do not contain this information but in rare cases the barcodes of the barcode.tsv table may also contain the cell type.
  • In the paper's supplementary materials: Look through the tables provided in the paper's supplementary materials to see if one of the tables has the barcode annotations
  • On a dedicated scRNA-Seq portal: If the experiment is on HCA, you may find a table with annotations on their website. Alternatively, if the paper mentions that the data can be visualized on cellxgene or UCSC Cell Browser, these platforms also provide barcode -cell type annotations to download for each experiment.
  • On GitHub or the author's website: If the authors link a GitHub of the project in their paper, tables may be provided. Similarly with any dedicated website
  • On EBI: If the experiment is also on EBI, check the supplementary files that are provided.

Looking for barcodes can take a lot of time, but checking all the above options is worth it to maximize the number of experiments in the database. If you cannot find the barcodes of a scRNA-Seq experiment, remember to add it in the scRNASeqExperiment_not_included.tsv table along with information on where you looked.

During annotation

  • (Check if you have the most updated version of UBERON)
  • Annotate the experiment's libraries with all the information you can find
  • Do not limit yourself to information retrieved by the scripts
  • Cross-reference information from SRA and the paper to have as accurate annotations as possible
  • Use the paper to understand what author abbreviations for cell types mean
  • Leave as comments any information that could potentially influence the RNA-Seq contents of the libraries (for example how were the tissues obtained, biopsy, postmortem, how many hours?)
  • Remember to fill the condition column if needed and the whiteList column for scRNA-Seq target-based experiments
  • Use strategies such as filtering and sorting the annotation tables (see next section)
  • Ask help if you are unsure about any step of the annotation process!

For scRNA-Seq experiments with barcodes or reclustering:

  • If necessary convert the barcode/reclustering information to table format (Use the H5AD script or R for rds/RData)
  • Check that the library names in your annotation table (column lib_names) are similar to those in the barcode/reclustering table
  • Check that there is the same number of library names in the barcode/reclustering table and the FL/TB annotation table
  • If not either be very careful when using the scripts or (recommended) remove all lines in the barcode/reclustering tables concerning libraries not in the FL/TB annotation table.
  • If you need to reformat tables/remove lines/etc., record your process in a Jupyter Notebook or other so that your annotations are reproducible

After annotation

  • Check again that annotations are correct and nothing is missing
  • Use standard filters (see below) to confirm there are no empty spaces before or after ontology terms that can cause errors to the pipeline
  • Transfer library and experiment annotations to the respective cumulative tables
  • Copy the barcode tables to the GitLab repository folder on your PC
  • Commit and Push your annotations to GitLab!
  • Keep a back-up of the files you used during annotation in case you need to come back to them later

Technical tips

Opening annotation files in .tsv format

Choosing an application

  • LibreOffice on Mac or Linux
  • OpenOffice calc on Windows

Settings

  • Select and open your file
  • Encoding - character set: UTF-8/Unicode
  • Choose {tab} as field separator
  • The text separator should be "
  • Check the box: quoted field as text
  • Check the preview to confirm the file is openning correctly

How do I open a txt or tsv file as a spreadsheet in OpenOffice calc?

  • Open the OpenOffice app
  • Select the spreadsheet option
  • File > Open
  • Navigate to the appropriate folder
  • Select your files
  • Open the drop-down menu on the right, next to the file name
  • Choose the "Text CSV" option (your files are still selected even if they do not appear in the viewer window)
  • Select Open and choose the settings described above

Working with annotation files

Sort, filter and format

During annotation, to help you annotate rapidly hundreds or thousands of lines with the same ontology term, use the spreadsheet's sort options or Data > Filter > Autofilter on the columns of interest. After annotation, use the standard filter functionality to control for each column with ontology terms that empty spaces have not been accidentally added before or after the term (may happen in some columns when using the scRNA-Seq scripts).

Auto-formatting in OpenOffice may cause some unwanted changes to ontology terms (for example switching FBbt terms to Fbbt). You can disable Auto-formatting from Format > AutoFormat.

OpenOffice calc functions

Here are some useful functions you may need during annotation

  • CONCATENATE: to combine the contents of two columns, create a new column and then use CONCATENATE(cell_of_column1_1; cell_of_column2_1). Also to quickly comment unwanted libraries: CONCATENATE("#"; cell_of_column_1)
  • SUBSTITUTE: to quickly uncomment libraries: SUBSTITUTE(cell_with_commented_SRX; "#"; "")

Using the terminal

These tips relate mostly to the Linux/bash terminal. Use Tab to auto-fill commands or file names

Useful commands

  • cd: to change directory (if you are in the folder /home/user and want to go to the sub-folder my_folder: cd /home/user/my_folder, or cd my_folder)
  • ls: to see what is in your directory (ls -l to see the contents as a list)
  • pwd: to see where you are located
  • gunzip: to unzip .gz files (gunzip FILE_NAME)
  • tar -xf: to open tar archives (tar -xf FILE_NAME)
  • rm: to permanently remove files (rm FILE_NAME, also rm -r FOLDER_NAME to permanently remove a directory containing files. To use carefully)
  • mkdir: to create a new directory (mkdir FOLDER)
  • man: to see the help of a command (for example man pwd)
  • rmdir: to permanently remove empty directories (rmdir FOLDER_NAME)

Script shortcuts

Assuming you have your scripts in a "scripts" folder within a dedicated folder for annotations, and you create experiment-specific annotation folders with a "scRNA-Seq" folder.

  • Create annotation tables: python3 /home/user/my_annotations/scripts/create_scRNAseq_tables_v3.py SRXID /home/user/my_annotations/scRNA-Seq/SRXID

  • Create barcode tables: python3 /home/user/my_annotations/scripts/create_scRNASeq_barcode.py SRXID /home/user/my_annotations/scRNA-Seq/SRXID --scRNASeqLibrary /home/user/my_annotations/scRNA-Seq/SRXID/TB_scRNASeqLibrary_output.tsv --barcode_file /home/user/my_annotations/scRNA-Seq/SRXID/BARCODE_TABLE_NAME --colname_barcode BARCODE_COLUMN_NAME --colname_sample_name LIBRARY_COLUMN_NAME

(Remember: the column with the cell types must be named: cellTypeName in the barcode table)

  • Create barcode tables with reclustering: python3 /home/user/my_annotations/scripts/create_scRNASeq_barcode.py SRXID /home/user/my_annotations/scRNA-Seq/SRXID --scRNASeqLibrary /home/user/my_annotations/scRNA-Seq/SRXID/TB_scRNASeqLibrary_output.tsv --barcode_file /home/user/my_annotations/scRNA-Seq/SRXID/BARCODE_TABLE_NAME --colname_barcode BARCODE_COLUMN_NAME --colname_sample_name LIBRARY_COLUMN_NAME --cluster_cellType /home/user/my_annotations/scRNA-Seq/SRXID/CLUSTER_FILE_NAME

(In the cluster file, use 'clusterId' or 'clusterName' for the cluster column and 'cell_type' for the cell types)

  • Recluster full-length files: python3 /home/user/my_annotations/scripts/FullLengthLib_reclustering.py /home/user/my_annotations/scRNA-Seq/SRXID/FL_scRNASeqLibrary_output.tsv /home/user/my_annotations/scRNA-Seq/SRXID --fulllength_clustering /home/user/my_annotations/scRNA-Seq/SRXID/CLUSTER_FILE_NAME --colname_sample_name LIBRARY_COLUMN_NAME

(The column with the cell types must be named: cellTypeName in the reclustering table)

  • Create H5AD tables: python3 /home/user/my_annotations/scripts/h5ad_to_tsv.py /home/user/my_annotations/FOLDER_WITH_H5AD_FILES /home/user/my_annotations/FOLDER_TO_PUT_H5AD_TABLES

Solutions for Windows users

The scripts we currently use for annotations do not work using the Windows command prompt. The script that creates the annotation tables also does not work well in emulators because it is based on E-Direct to get SRA information. As a Windows user to get a functional terminal you have some alternatives:

  • Install a version of Linux on a virtual machine in your computer
  • Use conda

For the first option there are different tutorials: 1, 2. You will need to also download the VirtualBox Guest Additions and create a shared folder between your computer and the virtual machine. In any case, ask for the help of the "Centre Informatique" if it seems too complicated.