Genome Annotation - Lavadav/EPP531_AGA GitHub Wiki

RepeatModeler and RepeatMasker

Dependencies Needed

Prerequisites: Perl 5.8.0 or higher, Python 3.
Databases: Dfam, Repbase
Softwares: Tandem Repeat Finder (TRF), RECON, RMBlast, GenomeTools, LTR_retriever, MAFFT, CD-HIT, NINJA
Search Engine: We will use BLAST.
Core Softwares: RepeatMasker, RepeatModeler

Input Data: Redbud Genome

ln -s /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/Syri/Redbud_Genome_Hap2.fasta .

1. Building Database

# Load the right Perl
spack load /ajwoixl

/pickett_shared/software/RepeatModeler-2.0.3/BuildDatabase -name Redbud -engine ncbi Redbud_Genome_Hap2.fasta

2. RepeatModeler

/pickett_shared/software/RepeatModeler-2.0.3/RepeatModeler -pa 3 -engine ncbi -database Redbud 2>&1 | tee 00_Redbud_repeatmodeler.log

3. Merge All the repeat libraries

cat /pickett_shared/software/RepeatMasker/Libraries/eudicotyledons-rm.fa /pickett_shared/software/RepeatMasker/Libraries/RMRB.fasta Path_to/Redbud-families.fa > Redbud_totalRepeatLib.fa

4. RepeatMasker

#Mask our genome

/pickett_shared/software/RepeatMasker/RepeatMasker \
-lib Redbud_totalRepeatLib.fa \
-e rmblast \
-pa 3 \
-nolow \
-xsmall \
-gff \
Redbud_Genome_Hap2.fasta \
>& Redbud_1.0.0_RMasker.out

5. Download RNAseq Data from NCBI

Make a SRR-accession list .txt file

SRR10584713
SRR21411472
SRR21411471

Now lets download the data from NCBI

spack load sratoolkit

for i in $(cat srr_accessions.txt);
do
    prefetch $i && fasterq-dump $i
done

Merge all the fastq files in their respective pairs and compress them.

6. STAR index the masked genome

Copy/softlink your masked genome to current directory

spack load star

STAR \
        --runMode genomeGenerate \
        --genomeDir Hap1 \
        --genomeSAindexNbases 13 \
        --genomeFastaFiles Redbud_Ragtag_Salsa_Hap2.masked.fasta \
        --runThreadN 3

7. STAR Mapping RNAseq Data

STAR \
        --genomeDir Hap1 \
        --readFilesIn Redbud_rnaseq_1.fastq.gz Redbud_rnaseq_2.fastq.gz \
        --readFilesCommand zcat \
        --outFileNamePrefix Redbud_Hap1-rna_ \
        --outSAMtype BAM SortedByCoordinate \
        --outSAMstrandField intronMotif \
        --limitBAMsortRAM 107374182400 \
        --runThreadN 10 \
>& star_hap1.out

8. BRAKER

Input files for BRAKER3

cp /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/Syri/Redbud_Ragtag_Salsa_Hap2.masked.fasta .
cp /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/Syri/Redbud_Hap1-rna_Aligned.sortedByCoord.out.bam .

Download the orthoDB protein database for plants.

wget https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/Viridiplantae.fa.gz

#gunzip
gunzip Viridiplantae.fa.gz

Set the path for BRAKER and AUGUSTUS config files

export BRAKER_SIF=/sphinx_local/images/braker3_latest.sif
export AUGUSTUS_CONFIG_PATH=~/miniconda3/envs/busco/config
echo $AUGUSTUS_CONFIG_PATH

Set path for AUGUSTUS config file in singularity interactive shell

singularity shell -B $PWD $BRAKER_SIF
export AUGUSTUS_CONFIG_PATH=~/miniconda3/envs/busco/config
echo $AUGUSTUS_CONFIG_PATH

#Exit the interactive shell
Ctrl + D

Make a new directory

mkdir braker_hap1

Script for running BBRAKER

singularity exec -B $PWD /sphinx_local/images/braker3_latest.sif braker.pl --genome=Redbud_Ragtag_Salsa_Hap2.masked.fasta \
--bam=Redbud_Hap1-rna_Aligned.sortedByCoord.out.bam \
--prot_seq=Viridiplantae.fa \
--workingdir=braker_hap1 \
--threads 5 \
--useexisting \
--gff3 \
--AUGUSTUS_CONFIG_PATH $AUGUSTUS_CONFIG_PATH \
--species=Ccanadensis

Check the stats on gff3 file

cat braker.gff3 | awk '{a[$3]++}END{for(k in a){print k,a[k]}}'

Homework

Rub BUSCO on the protein fasta file.

9. EnTAP

The Eukaryotic Non-Model Transcriptome Annotation Pipeline (EnTAP) is designed to improve the accuracy, speed, and flexibility of functional gene annotation for de novo assembled transcriptomes in non-model eukaryotes. EnTAP Documentation

Rename the BRAKER protein file

mv braker.aa Ccanadensis_protein_hap1.fasta

Softlink the protein file to EnTAP directory

ln -s path_to/Ccanadensis_protein_hap1.fasta .

Load the required dependencies

spack load diamond
spack load rsem
spack load interproscan
spack load transdecoder

Run EnTAP

/sphinx_local/software/EnTAP-1.0.0/bin/EnTAP \
--runP \
-i Ccanadensis_protein_hap1.fasta \
--ini /sphinx_local/software/EnTAP-1.0.0/entap_config_Oct2023.ini \
-d /sphinx_local/software/EnTAP-1.0.0/bin/uniprot_sprot.dmnd \
-t 5