Analysis log - nuriagaralon/genome-analysis GitHub Wiki
Analyses
1. Genome assembly
Assembly
- 30/03/20: Running Canu on PacBio raw reads for the assembly. 4 contigs were obtained. Two of them do not have enough coverage to be considered part of the genome. A third one has more coverage (
covStat=75.14
) but it is still very little compared to the last one (covStat=5290.83
). This last contig has a sequence length of 2,563,357 nt, and it is suggested to be circular. - 03/04/20: Running QUAST, MUMmer and Gepard to assess assembly quality, comparing the Canu contigs to the reference genome (assembled by Christel et al.). The plot indicates a large relocation in the genome, and it is likely due to the genome being circular and it not starting at the same spot as the reference genome.
- 03/04/20: Running Circlator on the contigs.
- 04/04/20: Checking Gepard dot plots of reference vs reference and new assembly vs new assembly. Contig tig00000059 showed a weird dot plot, so blastn of this contig against the nucleotide database was ran. Initially, since the size was similar to the phage Christel et al. had detected, I thought it was that, but the BLAST hits revealed it to be a PacBio synthetic sequence.
- 04/04/20: run QUAST on only tig4064 vs the reference genome.
- 05/04/20: rerun QUAST with
--gene-finding
. Previously it was run locally but this time we try using UPPMAX. - 06/04/20: rerun Circlator because it did not work. This time it exits properly.
- 06/04/20: run QUAST on the Circlator output vs the reference genome. Creating relevant Gepard dot plots. It seems that the circularisation was successful but the translocation persists.
- 07/04/20: run blastx on the beginning of both the reference genome and the output from circlator: is the gene at the beginning the same? The first gene was not dnaA for the reference genome.
Annotation
- 08/04/20: Run Prokka and eggNOGmapper on the assembly.
- 27/04/20: Initial comparison between both annotations using R
- 12/05/20: Annotation refinement using R and BLAST
Synteny
- 08/04/20: Pick sequences for synteny analysis: Leptospirillum ferriphilum ML04 (out of curiosity), Leptospirillum ferrooxidans, Thermodesulfovibrio yellowstonii (2Mbp) and Nitrospira defluvii (4Mbp).
- 09/04/20: Synteny analysis using blastn, and visualisation on ACT and Circoletto
2. Differential expression analysis
Quality check and preprocessing
- 12/04/20: Quality check of reads with FastQC and evaluation with MultiQC
- 12/04/20: Read trimming with Trimmomatic and the parameters used for the paper. However, only one set of paired reads is trimmed as the reads are trimmed and available already. They will be quality checked to see if they can be used.
- 12/04/20: Quality check of trimmed reads (the ones I trimmed).
- 16/04/20: Quality check of trimmed reads (the ones available). The reads I trimmed and these are comparable, so we will make use of the already available data.
Mapping
- 16/04/20: Mapping the reads to the assembly using BWA and pipe it into SAMtools to generate a
.bam
file for each of the runs (using the paired reads output from Trimmomatic). - 22/04/20: Extracting the mapping statistics using SAMtools.
- 22/04/20: The statistics were identical between the technical replicates, so a python script was written and executed to compare the reads among themselves. The reads in the technical replicates were identical, so to save space and running time further analyses would be conducted in only half of the files. If kept, it might also bias further analyses, as we would be analysing double the data we actually have.
- 08/05/20: Indexing the BAM files with samtools for visualisation. Failed to visualise the files with IGV due to lack of memory.
- 21/05/20: Visualisation of BAM files with Artemis
Read counting
- 23/04/20: Reads that mapped to CDS were counted using HTSeq
- 28/04/20: Analysis of mapped reads using R: what percentage of the reads maps? What percentage maps to CDS?
Differential expression analysis
- 24/04/20: Loading differential expression data into R
- 25/04/20: Initial analysis using DESeq2, PCA plot
- 08/05/20: Volcano plot from DESeq2 data
- 12/05/20: Heat map of most significantly differently expressed genes from DESeq2 data
- 13/05/20: Differential expression of different functional categories from annotation and DESeq2 data
Metabolic and functional analysis
- 23/05/20: Analysis using KEGG Mapper and the ko numbers from eggNOGmapper
Comparative genomics
- 24/05/20: Comparative genomics using BLAST
Used software
Software | Links | Run details | Analysis |
---|---|---|---|
Artemis | web - man | Locally v18.1.0 | Mapped reads visualisation |
BLAST | web - man | UPPMAX v2.9.0+ | Contig verification, synteny analysis, annotation refinement, comparative genomics |
BWA | man | UPPMAX v0.7.17 | Mapping RNA reads |
Canu | man | UPPMAX v1.8 | Correction and trimming of PacBio reads, genome assembly |
Circlator | web - man | UPPMAX v1.5.5 | Circularisation of the assembly |
Circoletto | info | Online v07.09.16 | Synteny visualisation |
DESeq2 | web - man | Locally in R v1.26.0 | Differential expression analysis |
eggNOGmapper | man | Online v2.0.0 | Functional annotation |
FastQC | web - man | UPPMAX v0.11.8 | Illumina reads Quality Check |
Gepard | web | Locally v1.40 | Assembly vs Reference dot plot |
HTSeq | man | UPPMAX v0.9.1 | Read counting |
KEGG Mapper | web | Online Updated July 2019 | Metabolic and functional analysis |
MultiQC | web - man | UPPMAX v1.8 | Summary of quality analyses for all reads |
MUMmer | man | UPPMAX v3.23 | Assembly vs Reference dot plot |
Prokka | man | UPPMAX v1.12 | Genome annotation |
QUAST | man | Locally v5.0.2 | Assembly evaluation |
SAMtools | man | UPPMAX v1.10 | Handling read mapping files |
Trimmomatic | man | UPPMAX v0.36 | Trimming Illumina reads |
Other software
- GitHub was used for version control on the code and to store this wiki with analysis information.
- ssh was used to connect to UPPMAX
- bash was used as a language to execute commands in different ways
- grep was used to check outputs (for example, to find how many contigs were present in Canu's output)
- awk was used to handle FASTA files (for example, to separate multi-FASTA files into multiple single FASTA files) and summarise BLAST outputs.
- sed was used to manipulate some files
- R was used for handling data from BAM files statistics, HTSeq counts and annotations, as well as to conduct differential expression analysis using DESeq2