Module 3 Lab 1: Annotation of Redbud Genome - jacksonhturner/epp_531 GitHub Wiki

This code attempts to describe the annotation of the redbud genome.

Create the analysis directory.

mkdir RepeatModeler

Soft link the redbud genome into the analysis directory.

ln -s /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/Syri/Redbud_Genome_Hap2.fasta .

Use RepeatModeler to build a database for annotation.

spack load /ajwoixl

/pickett_shared/software/RepeatModeler-2.0.3/BuildDatabase -name Redbud -engine ncbi Redbud_Genome_Hap2.fasta

Run RepeatModeler. Since I'm running the -pa flag as 3, it will use 12 since it is using 4 cores each.

/pickett_shared/software/RepeatModeler-2.0.3/RepeatModeler -pa 3 -engine ncbi -database Redbud 2>&1 | tee 00_Redbud_repeatmodeler.log

Create a repeat library from the RepeatModeler results.

cat /pickett_shared/software/RepeatMasker/Libraries/eudicotyledons-rm.fa /pickett_shared/software/RepeatMasker/Libraries/RMRB.fasta /pickett_sphinx/projects/EPP531_AGA/turner/M3Lab1/RepeatModeler/Redbud_totalRepeatLib.fa > Redbud_totalRepeatLib.fa

Use RepeatMasker to mask the assembly.

/pickett_shared/software/RepeatMasker/RepeatMasker \
-lib Redbud_totalRepeatLib.fa \
-e rmblast \
-pa 3 \
-nolow \
-xsmall \
-gff \
/pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/Syri/Redbud_Genome_Hap2.fasta \
>& Redbud_1.0.0_RMasker.out

If you run nolow, it'll look at low similarity. If masked genome, you can replace masked regions with Ns or replace it with lowercase. If genome has combination of upper case and lower case letters, probably masked.

There's a divergence parameter that you can add that will affect the masker. This affects the degree of which the assembly is masked. If your genome is well studied, then it's better to play with this.

RNA-Seq and STAR

Create accession numbers to download RNA-Seq data. If you need to download more than 500, it's best to use multiple files.

nano rnaseq_data.txt

We'll be using transcriptomic reads for redbud with the following accession numbers: ERR706845, SRR1909127, SRR1909126, SRR957672. It's important to note which accessions contain single-end and paired-end reads. Populate the text file with the following, then save and exit:

ERR706845
SRR1909127
SRR1909126
SRR957672

Merge paired end files into one file (and single end files into one file too) and gzip it.

cat ERR706845_1.fastq >> paired_end_reads_R1.fastq; cat ERR706845_2.fastq >> paired_end_reads_R2.fastq
gzip paired_end_reads*.fastq

cat SRR1909126.fastq >> single_end_reads.fastq; cat SRR1909127.fastq >> single_end_reads.fastq; cat SRR957672.fastq >> single_end_reads.fastq
gzip single_end_reads.fastq

Soft link the masked genome assembly file into this directory and run STAR to index it.

ln -s /pickett_sphinx/projects/EPP531_AGA/turner/M3Lab1/RepeatMasker/Redbud_Genome_Hap2.fasta.masked .

STAR \
        --runMode genomeGenerate \
        --genomeDir Hap2 \
        --genomeFastaFiles Redbud_Genome_Hap2.fasta.masked \
        --runThreadN 3

Run STAR to map RNA-Seq data to masked genome assembly.

STAR \
        --genomeDir Hap2 \
        --readFilesIn paired_end_reads.fastq.gz \
        --readFilesCommand zcat \
        --outFileNamePrefix Redbud_Hap1-rna_single_end \
        --outSAMtype BAM SortedByCoordinate \
        --outSAMstrandField intronMotif \
        --limitBAMsortRAM 107374182400 \
        --runThreadN 10 \
>& star_hap2.out

BRAKER

Use BRAKER, an automated pipeline, to annotate masked genome. First, create analysis directory and navigate to it.

mkdir /pickett_sphinx/projects/EPP531_AGA/turner/M3Lab1/Braker
cd /pickett_sphinx/projects/EPP531_AGA/turner/M3Lab1/Braker

Copy braker input files into directory.

cp /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/Syri/Redbud_Ragtag_Salsa_Hap2.masked.fasta .
cp /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/Syri/Redbud_Hap1-rna_Aligned.sortedByCoord.out.bam .

Download the orthoDB database for plants.

wget https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/Viridiplantae.fa.gz

#gunzip
gunzip Viridiplantae.fa.gz

Set a path for braker and augustus config files per the following:

export BRAKER_SIF=/sphinx_local/images/braker3_latest.sif
export AUGUSTUS_CONFIG_PATH=/home/jturne88/miniconda3/envs/busco/config
echo $AUGUSTUS_CONFIG_PATH

Set a path for augustus config file in interactive singularity shell.

singularity shell -B $PWD $BRAKER_SIF
export AUGUSTUS_CONFIG_PATH=/home/jturne88/miniconda3/envs/busco/config
echo $AUGUSTUS_CONFIG_PATH

#Exit the interactive shell
Ctrl + D

Make directory for braker output.

mkdir braker_hap1

Create a shell script to run braker.

nano run_braker2.sh
#Fill it in with this:
singularity exec -B $PWD /sphinx_local/images/braker3_latest.sif braker.pl --genome=Redbud_Ragtag_Salsa_Hap2.masked.fasta \
--bam=Redbud_Hap1-rna_Aligned.sortedByCoord.out.bam \
--prot_seq=Viridiplantae.fa \
--workingdir=braker_hap1 \
--threads 5 \
--useexisting \
--gff3 \
--AUGUSTUS_CONFIG_PATH $AUGUSTUS_CONFIG_PATH \
--species=Ccanadensis

Run braker.

run_braker2.sh

Braker provided a braker.aa protein file. We'll run BUSCO on this file to check it for completeness with the following:

cd /pickett_sphinx/projects/EPP531_AGA/turner/M3Lab1/Braker/braker_hap1/
conda activate busco
busco --mode proteins -i braker.aa -o braker_aa_busco -l embryophyta -c 4

Upon running BUSCO, the output is the following:

        --------------------------------------------------
        |Results from dataset embryophyta_odb10           |
        --------------------------------------------------
        |C:72.5%[S:65.9%,D:6.6%],F:2.1%,M:25.4%,n:1614    |
        |1171   Complete BUSCOs (C)                       |
        |1064   Complete and single-copy BUSCOs (S)       |
        |107    Complete and duplicated BUSCOs (D)        |
        |34     Fragmented BUSCOs (F)                     |
        |409    Missing BUSCOs (M)                        |
        |1614   Total BUSCO groups searched               |
        --------------------------------------------------

EnTAP

Make and enter analysis directory for EnTAP.

mkdir /pickett_sphinx/projects/EPP531_AGA/turner/M3Lab1/EnTAP
cd /pickett_sphinx/projects/EPP531_AGA/turner/M3Lab1/EnTAP

Copy the braker protein file and enter it into the new directory and rename it.

cp /pickett_sphinx/projects/EPP531_AGA/turner/M3Lab1/Braker/braker_hap1/braker_aa_busco .
mv braker.aa Ccanadensis_protein_hap1.fasta

Load necessary dependencies.

spack load h7fm4ww #diamond
spack load rsem
spack load interproscan
spack load transdecoder

Run EnTAP.

/sphinx_local/software/EnTAP-1.0.0/bin/EnTAP \
--runP \
-i Ccanadensis_protein_hap1.fasta \
--ini /sphinx_local/software/EnTAP-1.0.0/entap_config_Oct2023.ini \
-d /sphinx_local/software/EnTAP-1.0.0/bin/uniprot_sprot.dmnd \
-t 5