Project plan - Siqi-Li-0112/Genome-Analysis GitHub Wiki

Research Goal

This research is aimed to assemble and annotate the genome of durian. Durian is an economically valuable fruit in Southeast Asia, studying its genome will help researchers have a better understanding of this species, including its evaluation and synthetic pathways of its special smell.

Methods and Software

Reads Preprocessing

This project will use Illumina reads to correct the latter assembly, so the Illumina reads need to be checked and trimmed. FastQC will be used to check the reads qualitity and then Trimmomatic will be used to trim the data. After that the data will be checked again by FastQC.

Genome Assembly

This project will use long read from PacBio to assemble the genome, so Canu is more suitable in this case. According to student manual, this step will take 17h when using 4 cores, but the manual suggest we just use 2 cores, so this step may take up to 34h.

Correct Assembly

The assembled genome will be corrected with the Illumina data, in this step BWA will be used to map the Illumina reads to the draft genome, and then Pilon will be used to improve the draft.

Assembly Evaluation

This project will repeat the original study. In the original study there is no reference genome, so in this project we should also evaluate the assembly quality without reference genome. In this case QUAST may be a good choice. But in manual there is no reference time for this program. Also, MUMmerplot will be an alternative method.

Gene Annotation

The original study use RNA-seq data to annotate the gene. In order to repeat this step, using BRAKER is a good choice. The reference time for this program is blank, considering other annotation programs take around 2h, this step should have 4h, just in case.

Differential Expression Analysis

In this step, STAR will be first used to map the RNA-seq data against the genome. Then HTseq will be used to count reads that map on the gens, and DESeq2 will be used to perform the differential expression analysis after we get the count table.

Time Schedule and Work Flow

  1. 4.15~4.17 Genome assembly with CANU, and Preprocessing Illumina reads at the same time.
  2. 4.20 Correct assembly
  3. 4.20~4.23 Assembly quality assessment
  4. 4.23~4.28 Genome annotation
  5. 4.29~5.4 RNA mapping and reads counting
  6. 5.5~5.11 Differential expression analysis