Genome Annotation - Lavadav/EPP531_AGA GitHub Wiki

RepeatModeler and RepeatMasker

Dependencies Needed

  1. Prerequisites: Perl 5.8.0 or higher, Python 3.
  2. Databases: Dfam, Repbase
  3. Softwares: Tandem Repeat Finder (TRF), RECON, RMBlast, GenomeTools, LTR_retriever, MAFFT, CD-HIT, NINJA
  4. Search Engine: We will use BLAST.
  5. Core Softwares: RepeatMasker, RepeatModeler

Input Data: Redbud Genome

ln -s /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/Syri/Redbud_Genome_Hap2.fasta .

1. Building Database

# Load the right Perl
spack load /ajwoixl

/pickett_shared/software/RepeatModeler-2.0.3/BuildDatabase -name Redbud -engine ncbi Redbud_Genome_Hap2.fasta

2. RepeatModeler

/pickett_shared/software/RepeatModeler-2.0.3/RepeatModeler -pa 3 -engine ncbi -database Redbud 2>&1 | tee 00_Redbud_repeatmodeler.log

3. Merge All the repeat libraries

cat /pickett_shared/software/RepeatMasker/Libraries/eudicotyledons-rm.fa /pickett_shared/software/RepeatMasker/Libraries/RMRB.fasta Path_to/Redbud-families.fa > Redbud_totalRepeatLib.fa

4. RepeatMasker

#Mask our genome

/pickett_shared/software/RepeatMasker/RepeatMasker \
-lib Redbud_totalRepeatLib.fa \
-e rmblast \
-pa 3 \
-nolow \
-xsmall \
-gff \
Redbud_Genome_Hap2.fasta \
>& Redbud_1.0.0_RMasker.out

5. Download RNAseq Data from NCBI

Make a SRR-accession list .txt file

SRR10584713
SRR21411472
SRR21411471

Now lets download the data from NCBI

spack load sratoolkit

for i in $(cat srr_accessions.txt);
do
    prefetch $i && fasterq-dump $i
done

Merge all the fastq files in their respective pairs and compress them.

6. STAR index the masked genome

Copy/softlink your masked genome to current directory

spack load star

STAR \
        --runMode genomeGenerate \
        --genomeDir Hap1 \
        --genomeSAindexNbases 13 \
        --genomeFastaFiles Redbud_Ragtag_Salsa_Hap2.masked.fasta \
        --runThreadN 3

7. STAR Mapping RNAseq Data

STAR \
        --genomeDir Hap1 \
        --readFilesIn Redbud_rnaseq_1.fastq.gz Redbud_rnaseq_2.fastq.gz \
        --readFilesCommand zcat \
        --outFileNamePrefix Redbud_Hap1-rna_ \
        --outSAMtype BAM SortedByCoordinate \
        --outSAMstrandField intronMotif \
        --limitBAMsortRAM 107374182400 \
        --runThreadN 10 \
>& star_hap1.out

8. BRAKER

  • Input files for BRAKER3
cp /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/Syri/Redbud_Ragtag_Salsa_Hap2.masked.fasta .
cp /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/Syri/Redbud_Hap1-rna_Aligned.sortedByCoord.out.bam .
  • Download the orthoDB protein database for plants.
wget https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/Viridiplantae.fa.gz

#gunzip
gunzip Viridiplantae.fa.gz
  • Set the path for BRAKER and AUGUSTUS config files
export BRAKER_SIF=/sphinx_local/images/braker3_latest.sif
export AUGUSTUS_CONFIG_PATH=~/miniconda3/envs/busco/config
echo $AUGUSTUS_CONFIG_PATH
  • Set path for AUGUSTUS config file in singularity interactive shell
singularity shell -B $PWD $BRAKER_SIF
export AUGUSTUS_CONFIG_PATH=~/miniconda3/envs/busco/config
echo $AUGUSTUS_CONFIG_PATH

#Exit the interactive shell
Ctrl + D
  • Make a new directory
mkdir braker_hap1
  • Script for running BBRAKER
singularity exec -B $PWD /sphinx_local/images/braker3_latest.sif braker.pl --genome=Redbud_Ragtag_Salsa_Hap2.masked.fasta \
--bam=Redbud_Hap1-rna_Aligned.sortedByCoord.out.bam \
--prot_seq=Viridiplantae.fa \
--workingdir=braker_hap1 \
--threads 5 \
--useexisting \
--gff3 \
--AUGUSTUS_CONFIG_PATH $AUGUSTUS_CONFIG_PATH \
--species=Ccanadensis
  • Check the stats on gff3 file
cat braker.gff3 | awk '{a[$3]++}END{for(k in a){print k,a[k]}}'

Homework

  • Rub BUSCO on the protein fasta file.

9. EnTAP

The Eukaryotic Non-Model Transcriptome Annotation Pipeline (EnTAP) is designed to improve the accuracy, speed, and flexibility of functional gene annotation for de novo assembled transcriptomes in non-model eukaryotes. EnTAP Documentation

  • Rename the BRAKER protein file
mv braker.aa Ccanadensis_protein_hap1.fasta
  • Softlink the protein file to EnTAP directory
ln -s path_to/Ccanadensis_protein_hap1.fasta .
  • Load the required dependencies
spack load diamond
spack load rsem
spack load interproscan
spack load transdecoder
  • Run EnTAP
/sphinx_local/software/EnTAP-1.0.0/bin/EnTAP \
--runP \
-i Ccanadensis_protein_hap1.fasta \
--ini /sphinx_local/software/EnTAP-1.0.0/entap_config_Oct2023.ini \
-d /sphinx_local/software/EnTAP-1.0.0/bin/uniprot_sprot.dmnd \
-t 5