List of important bioinfromtics problems - smangul1/online.bioinformatics GitHub Wiki

Metagenomics-based microbiome profiling

Latest benchmarking of metagenomics tools (https://doi.org/10.1038/nmeth.4458) has shown a lack of bioinformatics methods with a good balance in precision and sensitivity. This is a really important frontier of bioinformatics with the need for improved tools.

Methods for RNA-Seq analysis

RNA Sequencing

Before a sample can be sequenced, the sample must be prepared into a sequencing library. The library consists of short fragments of DNA that represent the input to be sequenced. To prepare a library, RNA extracted from a sample is converted into cDNA molecules. This is done because sequencing machines can only accept DNA molecules. The number of RNA molecules available at the beginning of the experiment is often too small to work with and requires amplification by PCR. After amplification, the cDNA molecules are fragmented into short fragments called reads. The machine’s accuracy decreases with increasing lengths of DNA due to the limits of the technology. The machine then discovers the sequence of base pairs (As, Cs, Gs. and Ts) that are encoded in the RNA samples - one short read at a time. The sequencer, an NGS machine such as the Illumina, outputs a file called the FASTQ file which contains the bases.

Sequencing: RNA-Seq library preparation is the process of creating short sequencing reads. The steps consist of first converting the RNA into cDNA and then amplifying the cDNA by PCR to detect the sequences and lastly cutting the genetic information into short pieces called reads. After the sequencing reads are prepared, they are sequenced in a Next Generation Sequencer which outputs a file containing the read sequences.

Read alignment

Read Alignment After obtaining the reads from the NGS machine, often the next step is read alignment, also called read mapping. In read alignment, the genome is given and the goal is to find the position in the genome with a sequence that matches the read sequence. Since genomes have repeats, there are often reads which align to multiple parts of the genome equally well. Software such as BWA and Bowtie assign reads to match regions, also called loci, in a random fashion.

Alternative Splicing

RNA-seq read alignment is complicated by the fact that genes create multiple RNA transcripts.The central dogma of molecular biology describes the way in which information from DNA is transcribed into RNA and how that information is used to generate protein through transcription. We will describe transcription only as it is relevant to RNA-Seq.

Transcription is the process of using the DNA as a template to create a single strand of RNA. DNA is copied by RNA polymerase into a form known as pre-mRNA which contains all the bases in the original DNA. In the next step, portions of the pre-mRNA called introns are cut out and the remaining strands called exons are glued together. The final mRNA with only exons spliced together, carry the genetic code which is expressed as a protein.

One gene may produce more than one variation of mRNA through alternative splicing, which occurs when exons combine and rearrange in different ways. This allows one gene to produce more than one type of protein. The different mRNA molecules produced are called transcripts or isoforms interchangeably. Now, one can only start to imagine the difficulty in discovering the transcriptome, the total of all the different isoforms in a RNA-seq sample.

The Central Dogma of Molecular Biology states that double stranded DNA is transcribed into single stranded RNA and that the RNA is translated into protein. A. A gene is a portion of DNA which becomes transcribed into mRNA molecules. A gene consist of introns, non-coding portions which will be spliced out, and the exons, the coding portions which encodes for protein. B. A pre-mRNA is transcribed from the DNA. This is almost a copy of the gene. C. pre-mRNA is processed further to produce a mature mRNA. The most significant processing step is the splicing out of introns, leaving the exons glued together. Different arrangements of exons may be formed in a process called alternative splicing. The arrangements of exons are often depicted by the dotted lines connecting the exons.
Although this is evolutionarily advantageous because different proteins may be generated from the same genetic code, this complicates the read mapping or alignment process. In RNA-Seq, we have all the alternative splicing arrangements of mRNA produced by a cell or a sample in the read library and are either called transcripts or isoforms of the gene. Our goal is to discover the transcriptome, the various alternative splicing arrangements and often to quantify or count the different arrangements of mRNA. We may or may not have the original DNA molecule as a reference.