0. Project plan - Sara-SL/GenomeAnalysis GitHub Wiki

Aim of the project

The aim of this study is to re-analyze the data used in the article Transcriptomic and epigenomic characterization of the developing bat wing in a similar way as the authors did, and also re-evaluate their biological conclusions. The article aims to find genetic determinations that shape bat wings by interpreting the molecular events that underline bat wing development. The article cover a lot of different analyses that will not be possible for this project because of limited recourses and time. This study will only cover some of the basic analyses and will therefore not be as extensive as the article.

Questions to answer in this study:

Is the quality of the Mnat.v1 genome comparable to that of the high-coverage bat genomes?
Is Mnat.v1 a reliable substrate for subsequent genomic analyses?
What gene expression differences can be involved in the morphological divergence in bat limb development?
How does the transcriptome differ in bat forelimb and hindlimb?
How does the transcriptome change between the stages CS15, CS16 and CS17?

Questions for extra analysis:

What is the heterozygosity level?
How many percentages does repetitive regions make of the genome?
Can we find potetntial lncRNAs associated with bat limb development?
How many potetntial lncRNAs can we find?
How many known lncRNA show differential expression in forelimb and hindlimb?
What characteristics do they have?
Can we identify regulatory elements that could be involved in controlling gene expression in developing bat limbs?

Type of data analyses

To answer these questions, I will perform the basic analyses(se below) and if there is time, I will also do as many extra analyses that I have time for.

Basic analyses

Reads preprocessing: trimming + quality check (before and after)
Genome assembly of Illumina reads.
Assembly quality assessment.
Transcriptome assembly.
Structural and functional annotation.
Differential expression analyses.

Extra analyses

Mapping of ChIP-seq reads and visualization.
Calculate the heterozygosity of the genome.
Analyses of long non-coding RNAs.
Assembly with different parameters/software.
Deeper analyses of differential expression analyses: e.g. different comparisons.

Workflow

1. Reads preprocessing: trimming + quality check (before and after)

I have been given a subset of the whole genome sequencing data (wgs_data), the RNA sequencing data (rna_seq_data) and the ChIP sequencing data used in the article. To see if my data has any problems that I need to be aware of before doing further analysis I will start by quality checking the wgs_data and the rna_seq_data using the software FastQC (estimated running time ~ 3 min per set). All data except two rna_seq libraries(paired) have been trimmed in forehand. I will therefore expect the data to have good quality but might need to trim the raw data.

For the data that does not have good quality, trimming might be necessary to increase the quality. In that case I will trim the data using the Trimmomatic software (estimated running time ~ 1-10 min per file (2 cores)). Then I will quality check again with FastQC to see if further trimming needs to be done.

2. Genome assembly of Illumina reads. + 3 Assembly quality assessment.

When I am happy about the quality of my data I will proceed by assembling it. I will do this to puzzle all reads to get a continuous sequence representing the genome/the RNA sequences. I will start with assembling the wgs_data using SOAPdenovo (estimated running time ~ 1,5 h (2 cores)). To see if how good the assembly was I will evaluate the assembly using MUMmerplot (estimated running time < 5 min (1 core)). A good assembly will have few scaffolds and give a linear plot. If the result is not good I will consider doing a new genome assembly using the software Spades (estimated running time ~ 1-7 days depending on the scaffold (6 cores)). I will use MUMmerplot to evaluate the result from this assembly as well.

4. Transcriptome assembly. + 3 Assembly quality assessment.

Next I will assemble the rna_seq data using the software Trinity (estimated running time ~ 1,5 h (2 cores)) and evaluate the assembly using MUMmerplot. RNA assembly is more complicated than DNA assembly but necessary for investigating differential expression in forelimb/hindlimb and the three different stages.

5. Structural and functional annotation.

When I have continuous sequences from the assemblies I will annotate the genome to make sense of it. I will use Marker2 (estimated running time: First round ~ 3,5 h (4 cores), Following rounds ~30-60 min (4 cores)) to predict genes (structural annotation) and the online tool eggNOGmapper (estimated running time ~ 1 h (HMM algorithm)) for functional annotation.

6. Differential expression analyses.

To be able to do differential expression analysis I have to align the RNA sequences to the DNA assembly. I will do this using the software Tophat (estimated running time ~ 5-30 min per file (2 cores)). For the differential analysis I will use the software Htseq (estimated running tima < 5 min (1 core)). From this analysis I hope to find genes that is differently expressed in forelimb and hindlimb since this could indicate that the gene is involved in the development of the wings. I also hope to find genes that are differently expressed in the three different stages to understand the different steps in bat wing development.

Estimated time & bottlenecks

Since I have never used any of these tools/sofwares before the estimated time of the software does not mirror how much time each step will take. Most of the time will be spent on understanding the softwares and how to handle the results. Though with that in mind DNA assembly and Annotation are two time bottlenecks that I need to consider since they have a long running time. To make sure I'll have time to do all basic analyses before the end of the course I have some deadlines I need to follow.

Time framework

Deadlines for some of the analysis:

17-04-2020 - Genome assembly

28-04-2020 - Transcriptome Assembly

05-05-2020 - Annotation

08-05-2020 - RNA mapping

I plan to have finished running all the softwares on the 11-05-2020 to have time to analyze the data.

Type of data

I will start the project with a subsample of the following read files:

6 Whole genome sequence libraries (BioProject PRJNA283550)

Bat_400
Bat_800
Bat_175
Bat_2to4kb
Bat_5to6kb
Bat_8to10kb

18 RNA-seq libraries (SRA accession code SRP051253)

3 individuals
Forelimb, hindlimb
CS15, CS16, CS17

For extra analysis:

18 ChIP-seq libraries (SRA accession code SRP051267)

Forelimb, hindlimb
CS15, CS16, CS17
Input, H3K27ac, H3K27me3

The rest of the data will be output data from each analysis. I will continuously check how much storage I have left (max 32 Gb in UPPMAX) and if I am running out of storage space, I will look over my files and see what I can do about it. Maybe I can remove some files.

Organization of data

For all metadata, I will have one sample per row and one variable per column and I will save the data as a .cvs files.
I will keep my code and data separated in different repositories.
I will use unique informative names when naming my files. Since I have never used these tools before I don’t know exactly what each software will output. This makes it difficult to know how I best should name my files from the start, but I will try to be as consistent as possible. I will also look over my naming as I go to see if I can improve my naming structure.
I will put numbers in the beginning of file and folder names to keep track of which order I have done all different steps.
I will compress large data files such as FASTQ and SAM file.