Genome Annotation - heelsplitter/Grootmyers_EPP_531_Applied_Genome_Analytics GitHub Wiki

RepeatModeler and RepeatMasker

Dependencies Needed

Prerequisites: Perl 5.8.0 or higher, Python 3.

Databases: Dfam, Repbase

Softwares: Tandem Repeat Finder (TRF), RECON, RMBlast, GenomeTools, LTR_retriever, MAFFT, CD-HIT, NINJA

Search Engine: We will use BLAST.

Core Softwares: RepeatMasker, RepeatModeler

Input Data: Redbud Genome

cd /pickett_sphinx/projects/EPP531_AGA/dgrootmy
mkdir repeatmodeler
cd repeatmodeler
source ~/.bashrc
ln -s /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/Syri/Redbud_Genome_Hap2.fasta .
  1. Building Database
# Load the right Perl
spack load /ajwoixl
#Did not run this portion personally.
/pickett_shared/software/RepeatModeler-2.0.3/BuildDatabase -name Redbud -engine ncbi Redbud_Genome_Hap2.fasta
  1. RepeatModeler
/pickett_shared/software/RepeatModeler-2.0.3/RepeatModeler -pa 3 -engine ncbi -database Redbud 2>&1 | tee 00_Redbud_re
#Did not run this portion personally.
  1. Merge All the repeat libraries
screen -S repeatmasker
cd /pickett_sphinx/projects/EPP531_AGA/dgrootmy
mkdir repeatmasker
cd repeatmasker
cp -r /pickett_sphinx/projects/EPP531_AGA/hkaur3/genome_annotation-redbud .
cd genome_annotation-redbud
nano Redbud_totalRepeatLib.fa
cat /pickett_shared/software/RepeatMasker/Libraries/eudicotyledons-rm.fa /pickett_shared/software/RepeatMasker/Libraries/RMRB.fasta Path_to/Redbud-families.fa > Redbud_totalRepeatLib.fa
#I am running this portion personally.
  1. RepeatMasker
#Mask our genome
/pickett_shared/software/RepeatMasker/RepeatMasker \
-lib Redbud_totalRepeatLib.fa \
-e rmblast \
-pa 3 \
-nolow \
-xsmall \
-gff \
Redbud_Genome_Hap2.fasta \
>& Redbud_1.0.0_RMasker.out
#Ran this part personally

#chmod 777 /pickett_sphinx/projects/EPP531_AGA/dgrootmy/repeatmasker/Repeatmasker_out

  1. Download RNAseq Data from NCBI

Make an SRR-accession list .txt file

screen -S mapping
cd /pickett_sphinx/projects/EPP531_AGA/dgrootmy
mkdir mapping
cd mapping
source ~/.bashrc
nano SRA.txt
#Entered the following without hashes:
#ERR706845
#SRR1909127
#SRR1909126
#SRR957672

Now lets download the data from NCBI

spack load sratoolkit
vdb-config --interactive
for i in $(cat SRA.txt);
do
    prefetch $i && fasterq-dump $i
done

Merge all the fastq files in their respective pairs and compress them.

#skip merging
gzip ERR706845_1.fastq
gzip ERR706845_2.fastq
gzip SRR1909127*
gzip SRR1909126*
gzip SRR957672*
  1. STAR index the masked genome

Copy/softlink your masked genome to current directory

screen -S STAR
cd /pickett_sphinx/projects/EPP531_AGA/dgrootmy/mapping
source ~/.bashrc
cp /pickett_sphinx/projects/EPP531_AGA/dgrootmy/repeatmasker/Repeatmasker_out/Redbud_Genome_Hap2.fasta.masked .
spack load star

STAR \
        --runMode genomeGenerate \
        --genomeDir Hap1 \
        --genomeSAindexNbases 13 \
        --genomeFastaFiles Redbud_Genome_Hap2.fasta.masked \
        --runThreadN 3
  1. STAR Mapping RNAseq Data

#AS GROUP

STAR \
        --genomeDir Hap1 \
        --readFilesIn Redbud_rnaseq_1.fastq.gz Redbud_rnaseq_2.fastq.gz \
        --readFilesCommand zcat \
        --outFileNamePrefix Redbud_Hap1-rna_ \
        --outSAMtype BAM SortedByCoordinate \
        --outSAMstrandField intronMotif \
        --limitBAMsortRAM 107374182400 \
        --runThreadN 10 \
>& star_hap1.out

8. BRAKER

Input files for BRAKER3

##/sphinx_local/images/braker3_latest.sif
cd /pickett_sphinx/projects/EPP531_AGA/dgrootmy
mkdir braker_04_22
cd braker_04_22
screen -S braker_04_22
cp /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/Syri/Redbud_Ragtag_Salsa_Hap2.masked.fasta .
cp /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/Syri/Redbud_Hap1-rna_Aligned.sortedByCoord.out.bam .
cp /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/Syri/Viridiplantae.fa .

#export BRAKER_SIF=/sphinx_local/images/braker3_latest.sif
#export AUGUSTUS_CONFIG_PATH=~/augustus_data
#mkdir braker_hap1
#cd braker_hap1

Download the orthoDB protein database for plants.

screen -r -d braker_04_22
wget https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/Viridiplantae.fa.gz

#gunzip
gunzip Viridiplantae.fa.gz

Set the path for BRAKER and AUGUSTUS config files

export BRAKER_SIF=/sphinx_local/images/braker3_latest.sif
export AUGUSTUS_CONFIG_PATH=/home/dgrootmy/miniconda3/envs/busco/config
echo $AUGUSTUS_CONFIG_PATH

Set path for AUGUSTUS config file in singularity interactive shell

singularity shell -B $PWD $BRAKER_SIF
export AUGUSTUS_CONFIG_PATH=/home/dgrootmy/miniconda3/envs/busco/config
echo $AUGUSTUS_CONFIG_PATH

#Exit the interactive shell
Ctrl + D

Make a new directory

cd braker_04_22/
mkdir braker_hap1
cd braker_hap1

Script for running BBRAKER

I did not run this portion myself

singularity exec -B $PWD /sphinx_local/images/braker3_latest.sif braker.pl --genome=Redbud_Ragtag_Salsa_Hap2.masked.fasta \
--bam=Redbud_Hap1-rna_Aligned.sortedByCoord.out.bam \
--prot_seq=Viridiplantae.fa \
--workingdir=braker_hap1 \
--threads 5 \
--useexisting \
--gff3 \
--AUGUSTUS_CONFIG_PATH $AUGUSTUS_CONFIG_PATH \
--species=Ccanadensis

Check the stats on gff3 file

cat braker.gff3 | awk '{a[$3]++}END{for(k in a){print k,a[k]}}'

Homework

Run BUSCO on the protein fasta file.

...

Earlier step that gave an error message

#singularity exec -B $PWD /sphinx_local/images/braker3_latest.sif braker.pl --genome=Redbud_Ragtag_Salsa_Hap2.masked.fasta \
#--bam=Redbud_Hap1-rna_Aligned.sortedByCoord.out.bam \
#--prot_seq=Viridiplantae.fa \
#--workingdir=braker_hap1 \
#--threads 5 \
#--useexisting \
#--gff3 \
#--AUGUSTUS_CONFIG_PATH $AUGUSTUS_CONFIG_PATH \
#--species=Ccanadensis

9. EnTAP

The Eukaryotic Non-Model Transcriptome Annotation Pipeline (EnTAP) is designed to improve the accuracy, speed, and flexibility of functional gene annotation for de novo assembled transcriptomes in non-model eukaryotes. EnTAP Documentation

Rename the BRAKER protein file

screen -S entap
cd /pickett_sphinx/projects/EPP531_AGA/dgrootmy
mkdir entap
cd entap
source ~/.bashrc
cp /pickett_sphinx/projects/EPP531_AGA/hkaur3/genome_annotation-redbud/braker/braker_hap1/braker.aa .
mv braker.aa Ccanadensis_protein_hap1.fasta
#spack load diamond
spack load /ndn4ntw
spack load rsem
spack load interproscan
spack load transdecoder

Run EnTAP

/sphinx_local/software/EnTAP-1.0.0/bin/EnTAP \
--runP \
-i Ccanadensis_protein_hap1.fasta \
--ini /sphinx_local/software/EnTAP-1.0.0/entap_config_Oct2023.ini \
-d /sphinx_local/software/EnTAP-1.0.0/bin/uniprot_sprot.dmnd \
-t 5