14. Genome Annotation with BRAKER - davidaray/Genomes-and-Genome-Evolution GitHub Wiki

Genome Annotation with BRAKER

For this work, you will use a different Singularity container that was built by the people who wrote the software you'll use.

Installation

To keep things organized, go to /lustre/scratch/[eraider]/gge2024/container/ and download the container with:

singularity build braker3.sif docker://teambraker/braker3:latest

This will take a while.

Lastly, the files you will use in this exercise are available from this directory. Do this.

mkdir -p /lustre/scratch/[eraider]/gge2024/braker
cd /lustre/scratch/[eraider]/gge2024/braker
cp -r /lustre/scratch/frcastel/gge2024/annotation_files/* .

If you ls into the directory you just copied, you will notice there are two fasta files and two directories.

GRCh83_chromosome17.fasta is a 81 Mb file of the chromosome 17 human reference assembly, whereas filtered_mammalia.fasta is a filtered version of the largest database of the Vertebrata dataset from OrthoDB. This file contains 3,835,124 proteins that we'll use to train the gene models.

We are using Chromosome 17 because:

Chromosome 17 is rich in protein-coding genes, having the second highest gene density in the genome. (Read more about it here).
The 2.8 Mb locus responsible for Miller-Dieker syndrome you learned about in exome sequencing, is located in this autosome. This paper has everything you want to know about this locus.
Abnormalities of chromosome 17 are important molecular genetic events in human breast cancers. (Read more about it here).
This chromosome is not so large that it will take many hours to annotate, and not too small so that you will train a good number of gene models to understand the complexities of analyzing this data.

What does 'train the gene models' mean? As discussed in class, we use information from genes we already know about to figure out what to look for when searching for genes in a new genome assembly. In fancier language, training gene models using BRAKER involves building statistical representations of gene structures based on known gene data, such as RNA-Seq alignments, protein alignments, or pre-annotated reference genomes. These statistical models are then used to predict genes in the genome of interest.

Usage

BRAKER3 runs in different modes. It can make use of the genome alone, genome + proteins, genome + RNA-Seq, or all of them. You can find the details of how each mode works here.

Annotating a genome assembly

As discussed in class, annotating a genome is a complex process that invokes several pieces of software, each one of them with specific tasks. BRAKER3 is a pipeline that automates this process so the code looks as simple as any other you have used during the semester. The script you will use is braker_human17.sh, and looks like this:

#!/bin/bash
#SBATCH --job-name=braker17
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --partition=nocona
#SBATCH --nodes=1
#SBATCH --ntasks=10
#SBATCH --mem=0

# Get the SIF file into the path
BRAKER_SIF=/lustre/scratch/[eraider]/gge2024/container/braker3.sif

# Create the working directory or delete it if it exists
WORKDIR=/lustre/scratch/[eraider]/gge2024/braker
BRAKER_OUT=chr_17

# What do you think the next four lines do and why would they be useful?
if [ -d $BRAKER_OUT ]; then
    rm -r $BRAKER_OUT; else
    mkdir $BRAKER_OUT
fi

# Run singularity
singularity exec -B "$WORKDIR":"$WORKDIR" "$BRAKER_SIF" braker.pl \
    --genome="$WORKDIR"/annotation_files/GRCh38_chromosome17.fasta \
    --prot_seq="$WORKDIR"/annotation_files/filtered_mammalia.fasta \
    --softmasking \
    --workingdir="$WORKDIR"/"$BRAKER_OUT" \
    --threads $SLURM_NTASKS \
    --gff3

Simple, right? Let's break it down. It is crucial that you understand what each of the arguments do, and are aware of other available options. You can find a large list of those arguments by running singularity exec -B "$WORKDIR":"$WORKDIR" "$BRAKER_SIF" braker.pl --help.

The INPUT FILE OPTIONS section contains all arguments related to the data you will give BRAKER for genome annotation. The first option is, of course, the genome assembly, which is specified with the --genome argument. The --bam option can be used if you had BAM files of RNA-Seq alignments, while --prot_seq if you have some proteins. The choice of proteins will make a huge impact on annotation. Although BRAKER does a fairly good job when using protein evidence of distantly related species, using proteins from closely related ones is generally preferred. The --hints argument is utilized when you have preliminary information about the intron structure of the species (or closely related ones), or if a gene model was trained with another software beforehand. The last arguments for data input are for raw RNA-Seq which are automatically downloaded from the SRA. I encountered some problems with those flags when I tried to use them, so we will not cover it in class.

The other sections of the help page contain arguments related to how you want the algorithm to behave. I reccommend you have a close look at all of them, especially if you are performing a genome annotation in non-model organisms. The arguments we are using from all of those options are --softmasking which indicates to the gene finders that we have marked repetitive regions in the DNA with lowercase letters (a,t,c,g); --threads to speed up the process using all cores available. I use $SLURM_NTASKS there because this environment variable takes the number that we provide in #SBATCH --ntasks=36 by default. Finally, --gff3 is a personal preference to get a GFF3 besides the GTF file that BRAKER gives as default. For me, it is more convenient to transform GFF3 files into various formats and extract relevant information.

This example took around 7 hours to complete, so make sure you start working on it soon! Imagine how long it would take for a full mammalian genome if it takes this long for a single, relatively small chromosome.

If you are wondering whether or not BRAKER3 completed its job succesfully, besides the braker.gtf, braker.aa and braker.codingseq files you will find in your directory, you can also open the err file and will notice in the last lines the following:

[Wed Nov 22 05:57:01 2023] Finished spliced alignment
[Wed Nov 22 05:57:02 2023] Flagging top chains
[Wed Nov 22 05:57:03 2023] Processing the output
[Wed Nov 22 05:57:07 2023] Output processed
[Wed Nov 22 05:57:07 2023] ProtHint finished.

Obtaining annotation stats

You should now have some files in the chr_17 directory for the annotation of this chromosome: coding sequences .codingseq, proteins .aa, and the gene structures prediction information as .gff3 and .gtf. How can you know whether or not the annotation is good? We will use some files in the scripts directory to answer this question.

Let's do a quick exploration of the annotation with:

python3 scripts/analyze_exons.py -f chr_17/braker.gtf

You will see something like:

Number of transcripts: 3901
Largest number of exons in all transcripts: 64
Monoexonic transcripts: 839
Multiexonic transcripts: 3062
Mono:Mult Ratio: 0.27
Boxplot of number of exons per transcript:
Min: 1
25%: 2
50%: 3
75%: 5
Max: 64

This script gives a brief outline of the exon stats of the predicted genes. There are 839 transcripts with only one exon and 3062 with multiple, and per every multiexonic gene, 0.27 are monoexonic. Also, there is one gene with 64 exons!

To get a more detailed report, run:

python3 scripts/predictionReport.py chr_17/braker.gtf chr_17/hintsfile.gff chr17_prediction.pdf

Now, with the help of this report and your knowledge on the UNIX terminal, answer the questions below.

For you to do

Once the genome annotation has completed, use the skills you have acquired in this course to answer the questions below. All these questions can be answered with the commands you have learned in this course. If you need a refresher, go to Tutorial 1, Logging In to HPCC and an Intro to Bash and Linux Navigation.

Answer the questions below and upload your results to Blackboard under 14. Gene Annotation.

How many genes were annotated in the human reference assembly of chromosome 17?
Based on your answer to the previous question, what is the gene density of this scaffold? That is, gene number/genome size. In this case, this scaffold is 83,257,441 bp long. Use some googling to figure out what the gene density of human chromosome 1 is. How do the two chromosomes compare in this regard?
Are all of them protein coding genes? You can compare the braker.aa and braker.gff3 files to answer this question. Explain the rationale behind your yes or no answer.
Because of time limits we can't work on the complex tasks associated with functional annotation of the genes, but you can still get an idea of what the annotated genes may do by comparing to known genes. Use the braker.aa file and randomly pick a protein sequence (DO NOT use any of the first few first genes listed, explore the file a bit more. Maybe go in a few hundred or a few thousand lines). You can use BLASTp to search for homologous proteins in human and other organisms. Use the default parameters for the BLAST search, so just copy the protein sequence and let the web-tool do its work. From our experience, though, it's better to use the InterProScan tool to search for protein domains. Use the same protein you used for the BLAST search and compare both results. If you were to report this gene in a paper, what would you say about it? How many exons and introns does it have? What's the length of that gene? What's its most likely function? Please attach a screenshot of the NCBI and InterPro searches, and anything else you used to come up with your answer, including the name of the gene and the position of the genome it is located in.
The tasks you completed using BRAKER trained the software only on protein data. In an earlier run, we used a novel software called TIBERIUS designed exclusively for mammalian genomes. It uses a Deep Learning approach for gene prediction and has been benchmarked as superior to BRAKER3 in its quality of annotation. The results of that analysis are included in the directory you copied in the beginning of this tutorial inside the folder tiberius. Are the annotations derived from TIBERIUS comparable to those resulting from the combination of genome, and proteins that your used with BRAKER? Which of both would you trust more? Why? Write a detailed response putting together the number of genes recovered with each method, gene density and everything you have answered in questions 1-4. Is the gene you functionally annotated with BRAKER found in tiberius.gtf? HINT: There is no gff3 file for the Tiberius annotation so, to get an estimate of gene counts in the gtf file make sure you use grep -w "gene" to obtain the right gene counts.

Sources and further reading

Understanding what is happening behind the scenes is fundamental to get the big picture of genome annotation. I highly recommend you read the BRAKER3 paper to get a better grasp of the process. You should also read the BRAKER3 GitHub repository to familiarize a lot more with its utilities. Even better, here's a great lecture by one the authors of BRAKER3, in which most of the details of their software are covered. Additionally, this page has tons of information and details about how to prepare protein evidence, how to evaluate the annotation, and much more that we couldn't cover in class.

Eventhough the gene training is done automatically by BRAKER3, you can get a lot more control of the process if you run it using Augustus first, you can venture yourself in trying to get some evidence a priori, to incorporate into BRAKER3 later on. This tutorial is straightforward, although it won't show you how to install it, and that can be a long process most of the times. You can read the AUGUSTUS paper and the GitHub repository.