Home - bennestor/hakea_genome GitHub Wiki
The native southwest Australian plant species Hakea prostrata (Proteaceae) is a model species for studying novel high nutrient-use-efficiency traits that have evolved in plants. These high nutrient-use-efficiency traits can guide the development of crop varieties resistant to low nutrient conditions, greatly reducing the need for expensive and environmentally damaging fertilisers. However, little is known about the genes underlying these traits in H. prostrata and no genome sequence resources have yet been published for southwest Australian Proteaceae. In this PhD project as part of the Applied Bioinformatics Lab at the University of Western Australia, I assembled a reference genome assembly for H. prostrata and analysed the diversity of its nutrient transporter gene families in comparison to sequences from a wide phylogenetic diversity of plant species. In particular I was interested in gene families for the major phosphorus (P) and nitrate uptake transporters gene in plants:
- Phosphorus Transporter 1 (PHT1) for inorganic P uptake and transport
- Nitrate Transporter 2 (NRT2) for high-affinity nitrate uptake and transport
- Nitrate Transporter 1/Peptide Transporter Family (NPF) for low-affinity nitrate uptake and transport
- Assemble and annotate a reference genome for H. prostrata
- Identify orthologs of transporter gene families in H. prostrata
- Compare protein sequences of transporter gene families in H. prostrata to protein sequences from other plant species
- Quantify the expression of important transporter genes and determine differential expression between roots and leaves of H. prostrata
/scratch/pawsey0149/bnestor/2021_06_17_Genome for genome assembly and annotation
/scratch/pawsey0149/bnestor/2020_11_20_Transporters for transporter gene family identification and analysis
Data | Source | Total reads |
---|---|---|
'Hakea Nanopore long-reads' | NGS Analysis Results/bnestor/raw_data/nanopore/nanopore.fastq.gz | 6.83 million |
'Hakea Illumina short-reads' | NGS Analysis Results/bnestor/raw_data/illumina/illumina_R1.fastq.gz NGS Analysis Results/bnestor/raw_data/illumina/illumina_R1.fastq.gz/illumina_R2.fastq.gz | 1.01 billion |
'Hakea RNAseq 11 hydroponics cluster roots, roots and mature leaf libraries' | NGS Analysis Results/bnestor/raw_data/rnaseq | 493 million |
'Hakea RNAseq 15 wild-growing leaf libraries' | NGS Analysis Results/bnestor/raw_data/rnaseq | 434 million |
1. FastQC on Illumina reads
1. Install FastQC v0.11.9 with conda
2. Run FastQC on illumina reads
export WD="${MYSCRATCH}/2021_06_17_Genome/0_raw_reads/illumina" && fastqc -t 24 -o $WD/fastqc_out $WD/illumina_R1.fastq $WD/illumina_R2.fastq
Qualities and sequence duplication levels are ok - look weird because of repeats in plant genomes. Need to trim last 8 bp according to fastqc (tutorial on fastqc graphs: https://rtsf.natsci.msu.edu/genomics/tech-notes/fastqc-tutorial-and-faq/. Note: bins are grouped by 5bp after 9bp on the x axis).
2. FastQC on Nanopore reads
1. Run FastQC on nanopore reads
export WD="${MYSCRATCH}/2021_06_17_Genome/0_raw_reads/nanopore" && fastqc -t 24 -o $WD/fastqc_out $WD/nanopore.fastq
Qualities are in the red, but that is ok for nanopore data as the polishing will fix it.
3. Trim illumina reads
1. Install fastp v0.20.1 in conda
export WD="${MYSCRATCH}/2021_06_17_Genome" && fastp --trim_tail1=7 -i $WD/0_raw_reads/illumina/illumina_R1.fastq -I $WD/0_raw_reads/illumina/illumina_R2.fastq -o $WD/1_QC/1_trim_illumina_out/illumina_R1_trim.fastq -O $WD/1_QC/1_trim_illumina_out/illumina_R2_trim.fastq
Resulted in 963 million total reads
2. FastQC of trimmed Illumina reads
export WD="${MYSCRATCH}/2021_06_17_Genome/1_QC" && fastqc -t 24 -o $WD/3_fastqc_illumina_out $WD/1_trim_illumina_out/illumina_R1_trim.fastq $WD/1_trim_illumina_out/illumina_R2_trim.fastq
Looks much better. Quality of read 2 goes into red a little bit in the last 5bp, but that should be ok.
Per base sequence quality quality scores across all bases (Sanger Illumina 1.9 encoding) 25-29 9±-99 Position in read (bp) 110-1 L 4 125-129 L 40-143
4. Genome size estimation
1. Install jellyfish v2.2.10 with conda
2. Jellyfish with 21 and 17 kmer
export WD="${MYSCRATCH}/2021_06_17_Genome/1_QC" && jellyfish count -t 24 -s 5G -C -m 21 -o $WD/2_jellyfish_out/21mer_counts.jf $WD/1_trim_illumina_out/illumina_R*_trim.fastq #With 17 kmer export WD="${MYSCRATCH}/2021_06_17_Genome/1_QC" && jellyfish count -t 24 -s 5G -C -m 17 -o $WD/2_jellyfish_out/17mer_counts.jf $WD/1_trim_illumina_out/illumina_R*_trim.fastq #Make histograms jellyfish histo -t 24 -o 21mer_counts.histo 21mer_counts.jf jellyfish histo -t 24 -o 17mer_counts.histo 17mer_counts.jf
3. Install findGSE v0.1.0 (https://github.com/schneebergerlab/findGSE/blob/master/INSTALL)
library("findGSE") findGSE(histo="21mer_counts.histo", sizek=21, outdir="findGSE_21mer.out") findGSE(histo="17mer_counts.histo", sizek=17, outdir="findGSE_17mer.out")
21mer estimated size is 808,961,425 bp
17mer estimated size is 683,001,275 bp
Will use 21mer as it is recommended by tutorials/documentation, it will normalise less, and it is closer to the genome size of a previous H. prostrata short-read genome assembly.
4. Genomescope v1.0
21mer
Size: 568,186,619 - 568,386,090 bp
Heterozygosity: 1.23834% - 1.24346%
Results: http://genomescope.org/analysis.php?code=rsfivzAEUdZaRTY4eJn1