Methods - HeidiOttesen/GenomeAnalysis GitHub Wiki

A strain of Enterococcus faecium clade-A was collected from a rectal swab of patient committed to a nephrology ward in the Netherlands in 2000, the bacteria was grown in two different environments - human serum and in a medium at 37°C.

Sequencing was done on the instrument illumina HiSeq 2500 to obtain DNA-, RNA- and Transposon-seq data, DNA was also sequenced on the RS II SMRT from Pacific Biosciences and the MinION 2D from Oxford NanoPore. Click here - EBI to see more details about the sequencing methods.

Computational analysis methods

1. Preprocessing

1.1 FastQC

Quality control of the raw sequencing data was done with FastQC version 0.11.9 on all DNA and RNA raw read files from illumina. First it shows a table with basic statistics which include total number of sequences in the file in question, the read length in basepairs and GC content in %. Then it shows a diagram of the Per base sequence quality, the distribution of quality scores across all bases depending on position in the read, see example below. The report also includes information about GC content, duplication levels, adapters etc. There are no signs of adapters. Illumina DNA Raw Read 1 Illumina DNA Raw Read 2

1.2 Trimmomatic

Trimmomatic version 0.36 was used to remove low quality bases and reads. FastQC on the trimmed data did not show a great difference, the "raw" read files seem to have been trimmed before I received them so I will continue with assembly and transcriptomics using the original files received.

module load bioinfo-tools
module load trimmomatic/0.36 
module load java/sun_jdk1.8.0_151


id="illumina"

read1=$1 #forward
read2=$2 #reverse

java -jar $TRIMMOMATIC_HOME/trimmomatic.jar PE -phred33 $read1 $read2 ${id}.trimmomatic_r1_P ${id}.trimmomatic_r1_U ${id}.trimmomatic_r2_P ${id}.trimmomatic_r2_U TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

2. Assembly

2.1. Canu

Assembly of the PacBio DNA reads, all 6 files, was done with canu assembler version 2.0. Canu does a quality control, correction, trimming and de novo assembly.

module load bioinfo-tools
module load canu/2.0

id="E745"
output="/domus/h1/heidio/GenomeAnalysis/03_Assembly/1_PacBio/"

read1=$1
read2=$2
read3=$3
read4=$4
read5=$5
read6=$6

canu -p $id -d $output genomeSize=4.5m -pacbio $read1 $read2 $read3 $read4 $read5 $read6

2.2. Spades

Spades version 3.14.1 was also used as de novo assembler on the different DNA reads. First for only illumina seperately, then with combinations, illumina and NanoPore and lastly illumina with NanoPore and PacBio.

module load bioinfo-tools 
module load spades/3.14.1    #spades/3.14.1 latest version in April 2021

read1="/domus/h1/heidio/GenomeAnalysis/01_RawData/GenomicsData/Illumina/E745-1.L500_SZAXPI015146-56_1_clean.fq.gz"
read2="/domus/h1/heidio/GenomeAnalysis/01_RawData/GenomicsData/Illumina/E745-1.L500_SZAXPI015146-56_2_clean.fq.gz"
#read3="/domus/h1/heidio/GenomeAnalysis/01_RawData/GenomicsData/Nanopore/E745_all.fasta.gz" #Nanopore
read4="/domus/h1/heidio/GenomeAnalysis/01_RawData/GenomicsData/PacBio/m131023_233432_42174_c100519312550000001823081209281335_s1_X0.1.subreads.fastq.gz" #PacBio
read5="/domus/h1/heidio/GenomeAnalysis/01_RawData/GenomicsData/PacBio/m131023_233432_42174_c100519312550000001823081209281335_s1_X0.2.subreads.fastq.gz"
read6="/domus/h1/heidio/GenomeAnalysis/01_RawData/GenomicsData/PacBio/m131023_233432_42174_c100519312550000001823081209281335_s1_X0.3.subreads.fastq.gz"
read7="/domus/h1/heidio/GenomeAnalysis/01_RawData/GenomicsData/PacBio/m131024_200535_42174_c100563672550000001823084212221342_s1_p0.1.subreads.fastq.gz"
read8="/domus/h1/heidio/GenomeAnalysis/01_RawData/GenomicsData/PacBio/m131024_200535_42174_c100563672550000001823084212221342_s1_p0.2.subreads.fastq.gz"
read9="/domus/h1/heidio/GenomeAnalysis/01_RawData/GenomicsData/PacBio/m131024_200535_42174_c100563672550000001823084212221342_s1_p0.3.subreads.fastq.gz"

#id="EfaeciumE745"
outdir="/domus/h1/heidio/GenomeAnalysis/03_Assembly/5_illumina_PacBio/"

spades.py --careful -t 5 -k auto -o $outdir --pe1-1 $read1 --pe1-2 $read2 --pacbio -s1 $read4 -s2 $read5 -s3 $read6 -s4 $read7 -s5 $read8 -s6 $read9

2.3. Pilon

Pilon version 1.24 was used to check for errors and correct based on consensus. First I did BWA-mapping illumina and NanoPore reads onto the canu PacBio assembly and then combining with Pilon.

module load bioinfo-tools
module load samtools/1.10
module load bwa/0.7.17

ref="/domus/h1/heidio/GenomeAnalysis/03_Assembly/1_PacBio/E745.contigs.fasta" #eg pacbio contigs
read1="/domus/h1/heidio/GenomeAnalysis/03_Assembly/3_illumina_Nanopore/contigs.fasta"
#read1="/proj/g2021012/1_Zhang_2017/genomics_data/Illumina/E745-1.L500_SZAXPI015146-56_1_clean.fq.gz"
#read2="/proj/g2021012/1_Zhang_2017/genomics_data/Illumina/E745-1.L500_SZAXPI015146-56_2_clean.fq.gz"

id="BWA_illumina+Nanopore_contigs_on_PacBio"

bwa index $ref

bwa mem -t 5 $ref $read1 > ${id}.sam

samtools view -b ${id}.sam > ${id}.bam
samtools sort -o ${id}_sorted.bam ${id}.bam
samtools index ${id}_sorted.bam
module load bioinfo-tools
module load Pilon/1.24
module load java/sun_jdk1.8.0_151


ref="/domus/h1/heidio/GenomeAnalysis/03_Assembly/contig_links/PacBio.contigs.fasta"
mapped_reads="/domus/h1/heidio/GenomeAnalysis/03_Assembly/4_BWA/2_illumina_Nanopore_contigs_on_PacBio/BWA_illumina+Nanopore_contigs_on_PacBio_sorted.bam"
outdir="/domus/h1/heidio/GenomeAnalysis/03_Assembly/5_Pilon_combined/"
id="Pilon_illumina_Nanopore_PacBio"

java -jar $PILON_HOME/pilon.jar --genome $ref --frags $mapped_reads --outdir $outdir --output $id

3. Assembly Evaluation

3.1. Quast

Quast is a tool which calculates parameters to show the quality of one or more assemblies. Quast version 4.5.4 was run on all assemblies, including reference assembly from article.

module load bioinfo-tools
module load quast/4.5.4 

out="/domus/h1/heidio/GenomeAnalysis/04_Assembly_evaluation/2_Quast/2021.04.15_All"
mkdir $out
#input=$1
indir="/domus/h1/heidio/GenomeAnalysis/03_Assembly/contig_links"

#quast.py -t 2 --scaffolds -o $out $input
quast.py -t 2 --scaffolds -o $out $indir/*.fasta

Quast Report Quast plot

3.2. MUMmerplot

To plot my assembly against the reference E745 from the article I used MUMmerplot

module load bioinfo-tools
module load MUMmer/3.23
module load gnuplot/5.2.7

mummerplot -p mummerplot mummer_Pacbio_vs_illumina_Nanopore.mum
gnuplot mummer_Pacbio_vs_illumina_Nanopore.gp

Gnuplot E745 reference vs my E745

4. Annotation

4.1. PROKKA

Functional and Structural annotation was done using PROKKA

module load bioinfo-tools
module load prokka/1.45-5b58020

ref="/domus/h1/heidio/GenomeAnalysis/03_Assembly/0_ReferenceGenome/Efaecium_ref.fna"
read1=/domus/h1/heidio/GenomeAnalysis/03_Assembly/contig_links/Pilon_illumina_Nanopore_PacBio.fasta
id="PacBio_illumina_NanoPore"
outdir="/domus/h1/heidio/GenomeAnalysis/04_Assembly_evaluation/1_PROKKA/3_PB_illumina_NanoPore"

prokka --outdir $outdir --prefix $id $read1

Artemis - PROKKA annotation

5. Homology and synteny

5.1. blastn

I downloaded reference genomes of three other species (E. faecalis, E. casseliflavus, L. monocytogenes) and a different representative strain of E. faecium. Then I ran blastn to do pair-wise whole genome alignment between my assembly and the downloaded genomes.

module load bioinfo-tools
module load blast/2.11.0+


cat $0
query=/domus/h1/heidio/GenomeAnalysis/05_synteny/genomes_other/L_monocytogenes_GCF_000196035.1_ASM19603v1_genomic.fna
subject=/domus/h1/heidio/GenomeAnalysis/03_Assembly/5_Pilon_combined/Pilon_illumina_Nanopore_PacBio.fasta

blastn -query $query -subject $subject -outfmt '6' > blast_align_EfaeciumE745_L_monocytogenes.fasta

And visualized using Artemis act: E. faecium representative reference genome vs my E745 strain E. faecalis vs E745 E. casseliflavus vs E745 L. monocytogenes vs E745

6. Transcriptomics

6.1. BWA Mapping

Mapping RNA-seq transcripts unto my annotated assembly was done using BWA.

module load bioinfo-tools
module load samtools/1.10
module load bwa/0.7.17

refdir="/domus/h1/heidio/GenomeAnalysis/03_Assembly/contig_links/"
indir=/domus/h1/heidio/GenomeAnalysis/01_RawData/TranscriptomicsData/RNA-Seq_BH
outdir=/domus/h1/heidio/GenomeAnalysis/03_Assembly/6_Transcriptomic/1_BWA/RNA-Seq_BH

#Make temp folders:
tmpdir=$SNIC_TMP/align
mkdir $tmpdir
mkdir $tmpdir/fq
cd $tmpdir

#copy files to temp folders
cp $indir/trim_paired_ERR1797973_pass_1.fastq.gz $tmpdir/fq/
cp $indir/trim_paired_ERR1797973_pass_2.fastq.gz $tmpdir/fq/
cp $refdir/Pilon_illumina_Nanopore_PacBio.fasta $tmpdir/
cd $tmpdir

#unzip fasta files:
for v in fq/*
do
echo "Unzipping file: " $v
gunzip $v
done

ls -ahl
cd $tmpdir/fq/
ls -ahl

id="BWA_RNA_BHI_paired_ERR1797973"
ref=$tmpdir/Pilon_illumina_Nanopore_PacBio.fasta

read1=trim_paired_ERR1797973_pass_1.fastq
read2=trim_paired_ERR1797973_pass_2.fastq

#run BWA
bwa index $ref
bwa mem -t 2 $ref $read1 $read2 > ${id}.sam
samtools sort -o ${id}_sorted.bam ${id}.sam
samtools index ${id}_sorted.bam

cp *.bam *.bai $outdir/

(Example statistics from BWA mapping of RNAseq medium *72)

27538381 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
24629 + 0 supplementary
0 + 0 duplicates
27168853 + 0 mapped (98.66% : N/A)
27513752 + 0 paired in sequencing
13756876 + 0 read1
13756876 + 0 read2
26638208 + 0 properly paired (96.82% : N/A)
27143088 + 0 with itself and mate mapped
1136 + 0 singletons (0.00% : N/A)
2470 + 0 with mate mapped to a different chr
1298 + 0 with mate mapped to a different chr (mapQ>=5)

6.2. Read counting (HTSeq)

Counting the number of transcript reads to every gene was done using HTSeq

module load bioinfo-tools
module load htseq/0.12.4
module load python/3.7.2

indir=/proj/g2021012/nobackup/work/heidio
outdir=/domus/h1/heidio/GenomeAnalysis/03_Assembly/6_Transcriptomic/2_HTSeq

#Make folders
tmpdir=$SNIC_TMP/HTSeq
mkdir $tmpdir
mkdir $tmpdir/bam
cd $tmpdir

#copy files to temp
cp /domus/h1/heidio/GenomeAnalysis/04_Assembly_evaluation/1_PROKKA/3_PB_illumina_NanoPore/PacBio_illumina_NanoPore.gff $tmpdir/
cp $indir/*.bam* $tmpdir/bam/

ls -ahl $tmpdir

gff=$tmpdir/PacBio_illumina_NanoPore.gff

for v in $tmpdir/bam/*.bam
do
echo #Currently iterating: " $v
htseq-count -t CDS -i locus_tag -f bam $v $gff > ${v}_HTS.txt
done

ls -ahl $tmpdir/
ls -ahl $tmpdir/bam/

cp -r $tmpdir/bam/*txt $outdir

6.3. Differential Expression (deseq2)

For evaluating the differential expression statistics I used DeSeq2 in RStudio LFC MA Plot LFC shrink MA Plot After FDR Volcano plot Top 10 genes of interest