Module 3 Lab 1: Annotation of Redbud Genome - jacksonhturner/epp_531 GitHub Wiki
This code attempts to describe the annotation of the redbud genome.
Create the analysis directory.
mkdir RepeatModeler
Soft link the redbud genome into the analysis directory.
ln -s /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/Syri/Redbud_Genome_Hap2.fasta .
Use RepeatModeler to build a database for annotation.
spack load /ajwoixl
/pickett_shared/software/RepeatModeler-2.0.3/BuildDatabase -name Redbud -engine ncbi Redbud_Genome_Hap2.fasta
Run RepeatModeler. Since I'm running the -pa
flag as 3, it will use 12 since it is using 4 cores each.
/pickett_shared/software/RepeatModeler-2.0.3/RepeatModeler -pa 3 -engine ncbi -database Redbud 2>&1 | tee 00_Redbud_repeatmodeler.log
Create a repeat library from the RepeatModeler results.
cat /pickett_shared/software/RepeatMasker/Libraries/eudicotyledons-rm.fa /pickett_shared/software/RepeatMasker/Libraries/RMRB.fasta /pickett_sphinx/projects/EPP531_AGA/turner/M3Lab1/RepeatModeler/Redbud_totalRepeatLib.fa > Redbud_totalRepeatLib.fa
Use RepeatMasker to mask the assembly.
/pickett_shared/software/RepeatMasker/RepeatMasker \
-lib Redbud_totalRepeatLib.fa \
-e rmblast \
-pa 3 \
-nolow \
-xsmall \
-gff \
/pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/Syri/Redbud_Genome_Hap2.fasta \
>& Redbud_1.0.0_RMasker.out
If you run nolow, it'll look at low similarity. If masked genome, you can replace masked regions with Ns or replace it with lowercase. If genome has combination of upper case and lower case letters, probably masked.
There's a divergence parameter that you can add that will affect the masker. This affects the degree of which the assembly is masked. If your genome is well studied, then it's better to play with this.
RNA-Seq and STAR
Create accession numbers to download RNA-Seq data. If you need to download more than 500, it's best to use multiple files.
nano rnaseq_data.txt
We'll be using transcriptomic reads for redbud with the following accession numbers: ERR706845, SRR1909127, SRR1909126, SRR957672. It's important to note which accessions contain single-end and paired-end reads. Populate the text file with the following, then save and exit:
ERR706845
SRR1909127
SRR1909126
SRR957672
Merge paired end files into one file (and single end files into one file too) and gzip it.
cat ERR706845_1.fastq >> paired_end_reads_R1.fastq; cat ERR706845_2.fastq >> paired_end_reads_R2.fastq
gzip paired_end_reads*.fastq
cat SRR1909126.fastq >> single_end_reads.fastq; cat SRR1909127.fastq >> single_end_reads.fastq; cat SRR957672.fastq >> single_end_reads.fastq
gzip single_end_reads.fastq
Soft link the masked genome assembly file into this directory and run STAR to index it.
ln -s /pickett_sphinx/projects/EPP531_AGA/turner/M3Lab1/RepeatMasker/Redbud_Genome_Hap2.fasta.masked .
STAR \
--runMode genomeGenerate \
--genomeDir Hap2 \
--genomeFastaFiles Redbud_Genome_Hap2.fasta.masked \
--runThreadN 3
Run STAR to map RNA-Seq data to masked genome assembly.
STAR \
--genomeDir Hap2 \
--readFilesIn paired_end_reads.fastq.gz \
--readFilesCommand zcat \
--outFileNamePrefix Redbud_Hap1-rna_single_end \
--outSAMtype BAM SortedByCoordinate \
--outSAMstrandField intronMotif \
--limitBAMsortRAM 107374182400 \
--runThreadN 10 \
>& star_hap2.out
BRAKER
Use BRAKER, an automated pipeline, to annotate masked genome. First, create analysis directory and navigate to it.
mkdir /pickett_sphinx/projects/EPP531_AGA/turner/M3Lab1/Braker
cd /pickett_sphinx/projects/EPP531_AGA/turner/M3Lab1/Braker
Copy braker input files into directory.
cp /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/Syri/Redbud_Ragtag_Salsa_Hap2.masked.fasta .
cp /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/Syri/Redbud_Hap1-rna_Aligned.sortedByCoord.out.bam .
Download the orthoDB database for plants.
wget https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/Viridiplantae.fa.gz
#gunzip
gunzip Viridiplantae.fa.gz
Set a path for braker and augustus config files per the following:
export BRAKER_SIF=/sphinx_local/images/braker3_latest.sif
export AUGUSTUS_CONFIG_PATH=/home/jturne88/miniconda3/envs/busco/config
echo $AUGUSTUS_CONFIG_PATH
Set a path for augustus config file in interactive singularity shell.
singularity shell -B $PWD $BRAKER_SIF
export AUGUSTUS_CONFIG_PATH=/home/jturne88/miniconda3/envs/busco/config
echo $AUGUSTUS_CONFIG_PATH
#Exit the interactive shell
Ctrl + D
Make directory for braker output.
mkdir braker_hap1
Create a shell script to run braker.
nano run_braker2.sh
#Fill it in with this:
singularity exec -B $PWD /sphinx_local/images/braker3_latest.sif braker.pl --genome=Redbud_Ragtag_Salsa_Hap2.masked.fasta \
--bam=Redbud_Hap1-rna_Aligned.sortedByCoord.out.bam \
--prot_seq=Viridiplantae.fa \
--workingdir=braker_hap1 \
--threads 5 \
--useexisting \
--gff3 \
--AUGUSTUS_CONFIG_PATH $AUGUSTUS_CONFIG_PATH \
--species=Ccanadensis
Run braker.
run_braker2.sh
Braker provided a braker.aa
protein file. We'll run BUSCO on this file to check it for completeness with the following:
cd /pickett_sphinx/projects/EPP531_AGA/turner/M3Lab1/Braker/braker_hap1/
conda activate busco
busco --mode proteins -i braker.aa -o braker_aa_busco -l embryophyta -c 4
Upon running BUSCO, the output is the following:
--------------------------------------------------
|Results from dataset embryophyta_odb10 |
--------------------------------------------------
|C:72.5%[S:65.9%,D:6.6%],F:2.1%,M:25.4%,n:1614 |
|1171 Complete BUSCOs (C) |
|1064 Complete and single-copy BUSCOs (S) |
|107 Complete and duplicated BUSCOs (D) |
|34 Fragmented BUSCOs (F) |
|409 Missing BUSCOs (M) |
|1614 Total BUSCO groups searched |
--------------------------------------------------
EnTAP
Make and enter analysis directory for EnTAP.
mkdir /pickett_sphinx/projects/EPP531_AGA/turner/M3Lab1/EnTAP
cd /pickett_sphinx/projects/EPP531_AGA/turner/M3Lab1/EnTAP
Copy the braker protein file and enter it into the new directory and rename it.
cp /pickett_sphinx/projects/EPP531_AGA/turner/M3Lab1/Braker/braker_hap1/braker_aa_busco .
mv braker.aa Ccanadensis_protein_hap1.fasta
Load necessary dependencies.
spack load h7fm4ww #diamond
spack load rsem
spack load interproscan
spack load transdecoder
Run EnTAP.
/sphinx_local/software/EnTAP-1.0.0/bin/EnTAP \
--runP \
-i Ccanadensis_protein_hap1.fasta \
--ini /sphinx_local/software/EnTAP-1.0.0/entap_config_Oct2023.ini \
-d /sphinx_local/software/EnTAP-1.0.0/bin/uniprot_sprot.dmnd \
-t 5