Home - bennestor/hakea_genome GitHub Wiki

Table of Contents

Introduction

Background

The native southwest Australian plant species Hakea prostrata (Proteaceae) is a model species for studying novel high nutrient-use-efficiency traits that have evolved in plants. These high nutrient-use-efficiency traits can guide the development of crop varieties resistant to low nutrient conditions, greatly reducing the need for expensive and environmentally damaging fertilisers. However, little is known about the genes underlying these traits in H. prostrata and no genome sequence resources have yet been published for southwest Australian Proteaceae. In this PhD project as part of the Applied Bioinformatics Lab at the University of Western Australia, I assembled a reference genome assembly for H. prostrata and analysed the diversity of its nutrient transporter gene families in comparison to sequences from a wide phylogenetic diversity of plant species. In particular I was interested in gene families for the major phosphorus (P) and nitrate uptake transporters gene in plants:

  • Phosphorus Transporter 1 (PHT1) for inorganic P uptake and transport
  • Nitrate Transporter 2 (NRT2) for high-affinity nitrate uptake and transport
  • Nitrate Transporter 1/Peptide Transporter Family (NPF) for low-affinity nitrate uptake and transport

Aims

  • Assemble and annotate a reference genome for H. prostrata
  • Identify orthologs of transporter gene families in H. prostrata
  • Compare protein sequences of transporter gene families in H. prostrata to protein sequences from other plant species
  • Quantify the expression of important transporter genes and determine differential expression between roots and leaves of H. prostrata

Materials

Project Directory

/scratch/pawsey0149/bnestor/2021_06_17_Genome for genome assembly and annotation

/scratch/pawsey0149/bnestor/2020_11_20_Transporters for transporter gene family identification and analysis

Input data

Data Source Total reads
'Hakea Nanopore long-reads' NGS Analysis Results/bnestor/raw_data/nanopore/nanopore.fastq.gz 6.83 million
'Hakea Illumina short-reads' NGS Analysis Results/bnestor/raw_data/illumina/illumina_R1.fastq.gz NGS Analysis Results/bnestor/raw_data/illumina/illumina_R1.fastq.gz/illumina_R2.fastq.gz 1.01 billion
'Hakea RNAseq 11 hydroponics cluster roots, roots and mature leaf libraries' NGS Analysis Results/bnestor/raw_data/rnaseq 493 million
'Hakea RNAseq 15 wild-growing leaf libraries' NGS Analysis Results/bnestor/raw_data/rnaseq 434 million

Quality checking data

1. FastQC on Illumina reads

1. Install FastQC v0.11.9 with conda

2. Run FastQC on illumina reads

   export WD="${MYSCRATCH}/2021_06_17_Genome/0_raw_reads/illumina" && fastqc -t 24 -o $WD/fastqc_out $WD/illumina_R1.fastq $WD/illumina_R2.fastq 

Qualities and sequence duplication levels are ok - look weird because of repeats in plant genomes. Need to trim last 8 bp according to fastqc (tutorial on fastqc graphs: https://rtsf.natsci.msu.edu/genomics/tech-notes/fastqc-tutorial-and-faq/. Note: bins are grouped by 5bp after 9bp on the x axis).

2. FastQC on Nanopore reads

1. Run FastQC on nanopore reads

   export WD="${MYSCRATCH}/2021_06_17_Genome/0_raw_reads/nanopore" && fastqc -t 24 -o $WD/fastqc_out $WD/nanopore.fastq 

Qualities are in the red, but that is ok for nanopore data as the polishing will fix it.

3. Trim illumina reads

1. Install fastp v0.20.1 in conda

   export WD="${MYSCRATCH}/2021_06_17_Genome" && fastp --trim_tail1=7 -i $WD/0_raw_reads/illumina/illumina_R1.fastq -I $WD/0_raw_reads/illumina/illumina_R2.fastq -o $WD/1_QC/1_trim_illumina_out/illumina_R1_trim.fastq -O $WD/1_QC/1_trim_illumina_out/illumina_R2_trim.fastq 

Resulted in 963 million total reads

2. FastQC of trimmed Illumina reads

   export WD="${MYSCRATCH}/2021_06_17_Genome/1_QC" && fastqc -t 24 -o $WD/3_fastqc_illumina_out $WD/1_trim_illumina_out/illumina_R1_trim.fastq $WD/1_trim_illumina_out/illumina_R2_trim.fastq 

Looks much better. Quality of read 2 goes into red a little bit in the last 5bp, but that should be ok.

Per base sequence quality quality scores across all bases (Sanger Illumina 1.9 encoding) 25-29 9±-99 Position in read (bp) 110-1 L 4 125-129 L 40-143

4. Genome size estimation

1. Install jellyfish v2.2.10 with conda

2. Jellyfish with 21 and 17 kmer

   export WD="${MYSCRATCH}/2021_06_17_Genome/1_QC" && jellyfish count -t 24 -s 5G -C -m 21 -o $WD/2_jellyfish_out/21mer_counts.jf $WD/1_trim_illumina_out/illumina_R*_trim.fastq 
   
   #With 17 kmer 
   
   export WD="${MYSCRATCH}/2021_06_17_Genome/1_QC" && jellyfish count -t 24 -s 5G -C -m 17 -o $WD/2_jellyfish_out/17mer_counts.jf $WD/1_trim_illumina_out/illumina_R*_trim.fastq 
   
   #Make histograms 
   
   jellyfish histo -t 24 -o 21mer_counts.histo 21mer_counts.jf 
   
   jellyfish histo -t 24 -o 17mer_counts.histo 17mer_counts.jf 

3. Install findGSE v0.1.0 (https://github.com/schneebergerlab/findGSE/blob/master/INSTALL)

   library("findGSE") 
   
   findGSE(histo="21mer_counts.histo", sizek=21, outdir="findGSE_21mer.out") 
   findGSE(histo="17mer_counts.histo", sizek=17, outdir="findGSE_17mer.out") 

21mer estimated size is 808,961,425 bp

17mer estimated size is 683,001,275 bp

Will use 21mer as it is recommended by tutorials/documentation, it will normalise less, and it is closer to the genome size of a previous H. prostrata short-read genome assembly.

4. Genomescope v1.0

21mer

Size: 568,186,619 - 568,386,090 bp

Heterozygosity: 1.23834% - 1.24346%

Results: http://genomescope.org/analysis.php?code=rsfivzAEUdZaRTY4eJn1

⚠️ **GitHub.com Fallback** ⚠️