Method - Siqi-Li-0112/Genome-Analysis GitHub Wiki

Genome Assembly

Draft assembly

The first step is performing a draft assembly. The sequencing data I used in this project is SRR6037732_scaffold_10.fq.gz, which is a PacBio sequencing data for Durio zibethinus: Musang King, scaffold10. I used Canu to assemble the reads. Canu took in 233852 reads and after preprocessing and assembling, output 625 contigs. For detail report from canu please check the link.

Correct assembly

After the draft assembly, the draft genome will be corrected base on Illumina reads. The Illumina paired reads data for scaffold 10 were first tested for quality by FastQC. The FastQC results showed that the reads quality was good and thus no further preprocessing was needed. Then the Illumina reads were mapped to the draft genome by BWA. BWA first built an index based on the draft genome and then mapped the Illumina reads to the genome and generated a sam file. The sam file was then transform to a bam file by samtools. After that I used pilon to align the short reads with the draft genome to improve it, and got the final assembly of a 28.71 MB fasta file.

Assembly quality assessment

The quality of assembly was evaluated by Quast and MUMmer. The Quast results provided a total length of contigs which is 29714639bp, and N50=574089bp. For the MUMmer I first used nucmer to align the corrected genome and the reference genome. The reference genome is the scaffold 10 sequence downloaded from NCBI (NSDW01000010.1). Then I ran MUMmerplot to visualize the result.

Strucutral and functional annotation

For strucutral annotation I used Braker, which annotate the genome with transcriptome sequence data. Before running braker, transcriptome sequence data were checked by quality and trimmed by Trimmomatic. This is the quality report of untrimmedfrom FastQC and this is what it looks like after trimming. The next step is missed in my project plan, simply because I realized I need this step when I actually tried to run Braker. This step is using STAR to map the transcriptome reads to the assembled genome. After STAR generated the bam file, they were passed to Braker. Braker took the pilon-corrected genome and all the bam files from last step as input, and generated gtf files for the following steps. The gtf file contains informations about the location (starting and ending point) of sequence and its structure(exon, intron, etc). For functional annotation, I choosed eggnog which take protein sequence as input and align them against dataset and predict their function. To run eggnog I first use a script found in AUGUSTUS script directory. This script extract the protein coding sequence from gtf file and translate them into protein sequence. Then I upload this protein sequence file to the online eggNOG tools and got the annotaion results

Differential expression analysis

For this part I used HTSeq to count reads of each genes base on RNA-seq data and use DEseq2 to analysis the counts. HTSeq took bam file that aligned RNA-seq data to genome (made by STAR in last step) and gtf file (made by Braker from last step) as input. Before passed the gtf file generated by Braker to HTSeq I made a copy to extract all the rows that have "transcript_id" to prevent error. And I also used samtools to make index for each bam files. Then all the bam files and the copy of gtf file were passed to HTSeq to get their count results. Each count file was generated by corresponding bam file, thus shows the expression level of each sample. I got 8 samples in total, including 5 samples from clade Musang King (1 from leaf, 1 from root, 1 from stem and 1 from aril) and 3 from clade Monthong (all of them from aril). Then DESeq2 was used to analysis the different of expression level between arils from different clades and different parts from Musang King.