Genome Annotation - avince10/vincent_EPP531 GitHub Wiki

Input Data: Redbud Genome ln -s /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/Syri/Redbud_Genome_Hap2.fasta .

  1. Building Database

Load the right Perl

spack load /ajwoixl

/pickett_shared/software/RepeatModeler-2.0.3/BuildDatabase -name Redbud -engine ncbi Redbud_Genome_Hap2.fasta 2. RepeatModeler /pickett_shared/software/RepeatModeler-2.0.3/RepeatModeler -pa 3 -engine ncbi -database Redbud 2>&1 | tee 00_Redbud_repeatmodeler.log 3. Merge All the repeat libraries cat /pickett_shared/software/RepeatMasker/Libraries/eudicotyledons-rm.fa /pickett_shared/software/RepeatMasker/Libraries/RMRB.fasta Path_to/Redbud-families.fa > Redbud_totalRepeatLib.fa 4. RepeatMasker #Mask our genome

/pickett_shared/software/RepeatMasker/RepeatMasker
-lib Redbud_totalRepeatLib.fa
-e rmblast
-pa 3
-nolow
-xsmall
-gff
Redbud_Genome_Hap2.fasta \

& Redbud_1.0.0_RMasker.out

  1. Download RNAseq Data from NCBI Make a SRR-accession list .txt file

nano srr_accessions.txt SRR957672 SRR1909126 SRR1909127 ctrl x enter

Now lets download the data from NCBI

spack load sratoolkit

for i in $(cat srr_accessions.txt); do prefetch $i && fasterq-dump $i done Merge all the fastq files in their respective pairs and compress them.

cat SRR1909126.fastq SRR1909127.fastq SRR957672.fastq > redbudmerged.fastq

gzip redbudmerged.fastq

  1. STAR index the masked genome Copy/softlink your masked genome to current directory

spack load star

STAR
--runMode genomeGenerate
--genomeDir Hap1
--genomeSAindexNbases 13
--genomeFastaFiles Redbud_Ragtag_Salsa_Hap2.masked.fasta
--runThreadN 3

  1. STAR Mapping RNAseq Data

STAR
--genomeDir Hap1
--readFilesIn Redbud_rnaseq_1.fastq.gz Redbud_rnaseq_2.fastq.gz
--readFilesCommand zcat
--outFileNamePrefix Redbud_Hap1-rna_
--outSAMtype BAM SortedByCoordinate
--outSAMstrandField intronMotif
--limitBAMsortRAM 107374182400
--runThreadN 10 \

& star_hap1.out

  1. BRAKER Input files for BRAKER3 cp /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/Syri/Redbud_Ragtag_Salsa_Hap2.masked.fasta . cp /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/Syri/Redbud_Hap1-rna_Aligned.sortedByCoord.out.bam .

Download the orthoDB protein database for plants.

wget https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/Viridiplantae.fa.gz

#gunzip gunzip -d Viridiplantae.fa.gz

Set the path for BRAKER and AUGUSTUS config files

export BRAKER_SIF=/sphinx_local/images/braker3_latest.sif export AUGUSTUS_CONFIG_PATH=/home/avince10/miniconda3/envs/busco/config echo $AUGUSTUS_CONFIG_PATH

Set path for AUGUSTUS config file in singularity interactive shell singularity shell -B $PWD $BRAKER_SIF export AUGUSTUS_CONFIG_PATH=/home/avince10/miniconda3/envs/busco/config echo $AUGUSTUS_CONFIG_PATH

#Exit the interactive shell

Ctrl + D

Make a new directory

mkdir braker_hap1

Script for running BBRAKER

singularity exec -B $PWD /sphinx_local/images/braker3_latest.sif braker.pl --genome=Redbud_Ragtag_Salsa_Hap2.masked.fasta
--bam=Redbud_Hap1-rna_Aligned.sortedByCoord.out.bam
--prot_seq=Viridiplantae.fa
--workingdir=braker_hap1
--threads 5
--useexisting
--gff3
--AUGUSTUS_CONFIG_PATH $AUGUSTUS_CONFIG_PATH
--species=Ccanadensis

Check the stats on gff3 file

cat braker.gff3 | awk '{a[$3]++}END{for(k in a){print k,a[k]}}'

Homework Rub BUSCO on the protein fasta file.

  1. EnTAP The Eukaryotic Non-Model Transcriptome Annotation Pipeline (EnTAP) is designed to improve the accuracy, speed, and flexibility of functional gene annotation for de novo assembled transcriptomes in non-model eukaryotes. EnTAP Documentation

Rename the BRAKER protein file mv braker.aa Ccanadensis_protein_hap1.fasta Softlink the protein file to EnTAP directory ln -s path_to/Ccanadensis_protein_hap1.fasta .

 ln -s/pickett_sphinx/projects/EPP531_AGA/avince10/braker/braker_hap1/braker_hap1/Ccanadensis_protein_hap1.fasta

Load the required dependencies spack load diamond spack load diamond @2.0.4 spack load rsem spack load interproscan spack load transdecoder Run EnTAP /sphinx_local/software/EnTAP-1.0.0/bin/EnTAP
--runP
-i Ccanadensis_protein_hap1.fasta
--ini /sphinx_local/software/EnTAP-1.0.0/entap_config_Oct2023.ini
-d /sphinx_local/software/EnTAP-1.0.0/bin/uniprot_sprot.dmnd
-t 5