Genome Annotation - heelsplitter/Grootmyers_EPP_531_Applied_Genome_Analytics GitHub Wiki
RepeatModeler and RepeatMasker
Dependencies Needed
Prerequisites: Perl 5.8.0 or higher, Python 3.
Softwares: Tandem Repeat Finder (TRF), RECON, RMBlast, GenomeTools, LTR_retriever, MAFFT, CD-HIT, NINJA
Search Engine: We will use BLAST.
Core Softwares: RepeatMasker, RepeatModeler
Input Data: Redbud Genome
cd /pickett_sphinx/projects/EPP531_AGA/dgrootmy
mkdir repeatmodeler
cd repeatmodeler
source ~/.bashrc
ln -s /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/Syri/Redbud_Genome_Hap2.fasta .
- Building Database
# Load the right Perl
spack load /ajwoixl
#Did not run this portion personally.
/pickett_shared/software/RepeatModeler-2.0.3/BuildDatabase -name Redbud -engine ncbi Redbud_Genome_Hap2.fasta
- RepeatModeler
/pickett_shared/software/RepeatModeler-2.0.3/RepeatModeler -pa 3 -engine ncbi -database Redbud 2>&1 | tee 00_Redbud_re
#Did not run this portion personally.
- Merge All the repeat libraries
screen -S repeatmasker
cd /pickett_sphinx/projects/EPP531_AGA/dgrootmy
mkdir repeatmasker
cd repeatmasker
cp -r /pickett_sphinx/projects/EPP531_AGA/hkaur3/genome_annotation-redbud .
cd genome_annotation-redbud
nano Redbud_totalRepeatLib.fa
cat /pickett_shared/software/RepeatMasker/Libraries/eudicotyledons-rm.fa /pickett_shared/software/RepeatMasker/Libraries/RMRB.fasta Path_to/Redbud-families.fa > Redbud_totalRepeatLib.fa
#I am running this portion personally.
- RepeatMasker
#Mask our genome
/pickett_shared/software/RepeatMasker/RepeatMasker \
-lib Redbud_totalRepeatLib.fa \
-e rmblast \
-pa 3 \
-nolow \
-xsmall \
-gff \
Redbud_Genome_Hap2.fasta \
>& Redbud_1.0.0_RMasker.out
#Ran this part personally
#chmod 777 /pickett_sphinx/projects/EPP531_AGA/dgrootmy/repeatmasker/Repeatmasker_out
- Download RNAseq Data from NCBI
Make an SRR-accession list .txt file
screen -S mapping
cd /pickett_sphinx/projects/EPP531_AGA/dgrootmy
mkdir mapping
cd mapping
source ~/.bashrc
nano SRA.txt
#Entered the following without hashes:
#ERR706845
#SRR1909127
#SRR1909126
#SRR957672
Now lets download the data from NCBI
spack load sratoolkit
vdb-config --interactive
for i in $(cat SRA.txt);
do
prefetch $i && fasterq-dump $i
done
Merge all the fastq files in their respective pairs and compress them.
#skip merging
gzip ERR706845_1.fastq
gzip ERR706845_2.fastq
gzip SRR1909127*
gzip SRR1909126*
gzip SRR957672*
- STAR index the masked genome
Copy/softlink your masked genome to current directory
screen -S STAR
cd /pickett_sphinx/projects/EPP531_AGA/dgrootmy/mapping
source ~/.bashrc
cp /pickett_sphinx/projects/EPP531_AGA/dgrootmy/repeatmasker/Repeatmasker_out/Redbud_Genome_Hap2.fasta.masked .
spack load star
STAR \
--runMode genomeGenerate \
--genomeDir Hap1 \
--genomeSAindexNbases 13 \
--genomeFastaFiles Redbud_Genome_Hap2.fasta.masked \
--runThreadN 3
- STAR Mapping RNAseq Data
#AS GROUP
STAR \
--genomeDir Hap1 \
--readFilesIn Redbud_rnaseq_1.fastq.gz Redbud_rnaseq_2.fastq.gz \
--readFilesCommand zcat \
--outFileNamePrefix Redbud_Hap1-rna_ \
--outSAMtype BAM SortedByCoordinate \
--outSAMstrandField intronMotif \
--limitBAMsortRAM 107374182400 \
--runThreadN 10 \
>& star_hap1.out
8. BRAKER
Input files for BRAKER3
##/sphinx_local/images/braker3_latest.sif
cd /pickett_sphinx/projects/EPP531_AGA/dgrootmy
mkdir braker_04_22
cd braker_04_22
screen -S braker_04_22
cp /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/Syri/Redbud_Ragtag_Salsa_Hap2.masked.fasta .
cp /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/Syri/Redbud_Hap1-rna_Aligned.sortedByCoord.out.bam .
cp /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/Syri/Viridiplantae.fa .
#export BRAKER_SIF=/sphinx_local/images/braker3_latest.sif
#export AUGUSTUS_CONFIG_PATH=~/augustus_data
#mkdir braker_hap1
#cd braker_hap1
Download the orthoDB protein database for plants.
screen -r -d braker_04_22
wget https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/Viridiplantae.fa.gz
#gunzip
gunzip Viridiplantae.fa.gz
Set the path for BRAKER and AUGUSTUS config files
export BRAKER_SIF=/sphinx_local/images/braker3_latest.sif
export AUGUSTUS_CONFIG_PATH=/home/dgrootmy/miniconda3/envs/busco/config
echo $AUGUSTUS_CONFIG_PATH
Set path for AUGUSTUS config file in singularity interactive shell
singularity shell -B $PWD $BRAKER_SIF
export AUGUSTUS_CONFIG_PATH=/home/dgrootmy/miniconda3/envs/busco/config
echo $AUGUSTUS_CONFIG_PATH
#Exit the interactive shell
Ctrl + D
Make a new directory
cd braker_04_22/
mkdir braker_hap1
cd braker_hap1
Script for running BBRAKER
I did not run this portion myself
singularity exec -B $PWD /sphinx_local/images/braker3_latest.sif braker.pl --genome=Redbud_Ragtag_Salsa_Hap2.masked.fasta \
--bam=Redbud_Hap1-rna_Aligned.sortedByCoord.out.bam \
--prot_seq=Viridiplantae.fa \
--workingdir=braker_hap1 \
--threads 5 \
--useexisting \
--gff3 \
--AUGUSTUS_CONFIG_PATH $AUGUSTUS_CONFIG_PATH \
--species=Ccanadensis
Check the stats on gff3 file
cat braker.gff3 | awk '{a[$3]++}END{for(k in a){print k,a[k]}}'
Homework
Run BUSCO on the protein fasta file.
...
Earlier step that gave an error message
#singularity exec -B $PWD /sphinx_local/images/braker3_latest.sif braker.pl --genome=Redbud_Ragtag_Salsa_Hap2.masked.fasta \
#--bam=Redbud_Hap1-rna_Aligned.sortedByCoord.out.bam \
#--prot_seq=Viridiplantae.fa \
#--workingdir=braker_hap1 \
#--threads 5 \
#--useexisting \
#--gff3 \
#--AUGUSTUS_CONFIG_PATH $AUGUSTUS_CONFIG_PATH \
#--species=Ccanadensis
9. EnTAP
The Eukaryotic Non-Model Transcriptome Annotation Pipeline (EnTAP) is designed to improve the accuracy, speed, and flexibility of functional gene annotation for de novo assembled transcriptomes in non-model eukaryotes. EnTAP Documentation
Rename the BRAKER protein file
screen -S entap
cd /pickett_sphinx/projects/EPP531_AGA/dgrootmy
mkdir entap
cd entap
source ~/.bashrc
cp /pickett_sphinx/projects/EPP531_AGA/hkaur3/genome_annotation-redbud/braker/braker_hap1/braker.aa .
mv braker.aa Ccanadensis_protein_hap1.fasta
#spack load diamond
spack load /ndn4ntw
spack load rsem
spack load interproscan
spack load transdecoder
Run EnTAP
/sphinx_local/software/EnTAP-1.0.0/bin/EnTAP \
--runP \
-i Ccanadensis_protein_hap1.fasta \
--ini /sphinx_local/software/EnTAP-1.0.0/entap_config_Oct2023.ini \
-d /sphinx_local/software/EnTAP-1.0.0/bin/uniprot_sprot.dmnd \
-t 5