Analysis log - nuriagaralon/genome-analysis GitHub Wiki

Analyses

1. Genome assembly

Assembly

30/03/20: Running Canu on PacBio raw reads for the assembly. 4 contigs were obtained. Two of them do not have enough coverage to be considered part of the genome. A third one has more coverage (covStat=75.14) but it is still very little compared to the last one (covStat=5290.83). This last contig has a sequence length of 2,563,357 nt, and it is suggested to be circular.
03/04/20: Running QUAST, MUMmer and Gepard to assess assembly quality, comparing the Canu contigs to the reference genome (assembled by Christel et al.). The plot indicates a large relocation in the genome, and it is likely due to the genome being circular and it not starting at the same spot as the reference genome.
03/04/20: Running Circlator on the contigs.
04/04/20: Checking Gepard dot plots of reference vs reference and new assembly vs new assembly. Contig tig00000059 showed a weird dot plot, so blastn of this contig against the nucleotide database was ran. Initially, since the size was similar to the phage Christel et al. had detected, I thought it was that, but the BLAST hits revealed it to be a PacBio synthetic sequence.
04/04/20: run QUAST on only tig4064 vs the reference genome.
05/04/20: rerun QUAST with --gene-finding. Previously it was run locally but this time we try using UPPMAX.
06/04/20: rerun Circlator because it did not work. This time it exits properly.
06/04/20: run QUAST on the Circlator output vs the reference genome. Creating relevant Gepard dot plots. It seems that the circularisation was successful but the translocation persists.
07/04/20: run blastx on the beginning of both the reference genome and the output from circlator: is the gene at the beginning the same? The first gene was not dnaA for the reference genome.

Annotation

08/04/20: Run Prokka and eggNOGmapper on the assembly.
27/04/20: Initial comparison between both annotations using R
12/05/20: Annotation refinement using R and BLAST

Synteny

08/04/20: Pick sequences for synteny analysis: Leptospirillum ferriphilum ML04 (out of curiosity), Leptospirillum ferrooxidans, Thermodesulfovibrio yellowstonii (2Mbp) and Nitrospira defluvii (4Mbp).
09/04/20: Synteny analysis using blastn, and visualisation on ACT and Circoletto

2. Differential expression analysis

Quality check and preprocessing

12/04/20: Quality check of reads with FastQC and evaluation with MultiQC
12/04/20: Read trimming with Trimmomatic and the parameters used for the paper. However, only one set of paired reads is trimmed as the reads are trimmed and available already. They will be quality checked to see if they can be used.
12/04/20: Quality check of trimmed reads (the ones I trimmed).
16/04/20: Quality check of trimmed reads (the ones available). The reads I trimmed and these are comparable, so we will make use of the already available data.

Mapping

16/04/20: Mapping the reads to the assembly using BWA and pipe it into SAMtools to generate a .bam file for each of the runs (using the paired reads output from Trimmomatic).
22/04/20: Extracting the mapping statistics using SAMtools.
22/04/20: The statistics were identical between the technical replicates, so a python script was written and executed to compare the reads among themselves. The reads in the technical replicates were identical, so to save space and running time further analyses would be conducted in only half of the files. If kept, it might also bias further analyses, as we would be analysing double the data we actually have.
08/05/20: Indexing the BAM files with samtools for visualisation. Failed to visualise the files with IGV due to lack of memory.
21/05/20: Visualisation of BAM files with Artemis

Read counting

23/04/20: Reads that mapped to CDS were counted using HTSeq
28/04/20: Analysis of mapped reads using R: what percentage of the reads maps? What percentage maps to CDS?

Differential expression analysis

24/04/20: Loading differential expression data into R
25/04/20: Initial analysis using DESeq2, PCA plot
08/05/20: Volcano plot from DESeq2 data
12/05/20: Heat map of most significantly differently expressed genes from DESeq2 data
13/05/20: Differential expression of different functional categories from annotation and DESeq2 data

Metabolic and functional analysis

23/05/20: Analysis using KEGG Mapper and the ko numbers from eggNOGmapper

Comparative genomics

24/05/20: Comparative genomics using BLAST

Used software

Software	Links	Run details	Analysis
Artemis	web - man	Locally v18.1.0	Mapped reads visualisation
BLAST	web - man	UPPMAX v2.9.0+	Contig verification, synteny analysis, annotation refinement, comparative genomics
BWA	man	UPPMAX v0.7.17	Mapping RNA reads
Canu	man	UPPMAX v1.8	Correction and trimming of PacBio reads, genome assembly
Circlator	web - man	UPPMAX v1.5.5	Circularisation of the assembly
Circoletto	info	Online v07.09.16	Synteny visualisation
DESeq2	web - man	Locally in R v1.26.0	Differential expression analysis
eggNOGmapper	man	Online v2.0.0	Functional annotation
FastQC	web - man	UPPMAX v0.11.8	Illumina reads Quality Check
Gepard	web	Locally v1.40	Assembly vs Reference dot plot
HTSeq	man	UPPMAX v0.9.1	Read counting
KEGG Mapper	web	Online Updated July 2019	Metabolic and functional analysis
MultiQC	web - man	UPPMAX v1.8	Summary of quality analyses for all reads
MUMmer	man	UPPMAX v3.23	Assembly vs Reference dot plot
Prokka	man	UPPMAX v1.12	Genome annotation
QUAST	man	Locally v5.0.2	Assembly evaluation
SAMtools	man	UPPMAX v1.10	Handling read mapping files
Trimmomatic	man	UPPMAX v0.36	Trimming Illumina reads

Other software

GitHub was used for version control on the code and to store this wiki with analysis information.
ssh was used to connect to UPPMAX
bash was used as a language to execute commands in different ways
grep was used to check outputs (for example, to find how many contigs were present in Canu's output)
awk was used to handle FASTA files (for example, to separate multi-FASTA files into multiple single FASTA files) and summarise BLAST outputs.
sed was used to manipulate some files
R was used for handling data from BAM files statistics, HTSeq counts and annotations, as well as to conduct differential expression analysis using DESeq2