2__Bioinformatics - xinshuaiqi/My_books GitHub Wiki

Xinshuai Qi's Summary and Notes on Bioinformatics

-- by Xinshuai Qi

[TOC]

(last update on 12-4-2017)

Transcrptome Assembly

  • Raw reads clean

  • De novo Assembly

    • Trinity
      • 279 citation since 2011
    • Velvet
      • 6837 citation since 2008
  • SOAPdenovoTrans

    • 359 since 2014
  • Reference-based Assembly

    • Samtools
    • VCFtools
    • BCFtools
  • Picard

    • manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.

Genome assembly

  • ABySS

    • paper
    • 2458 citation since 2009
  • AllPATH LG

    • designed for at least 2 short reads library, high coverage; not support polyploid
    • graph based results
    • from the Computational Research and Development group at the Broad Institute
    • 773 citation since 2008

PacBio assembly

  • FALCON
  • HGAP
    • developed by PacBio
  • PBJelly

improvement

mummer4

mummer3 manual

mummerplots with ggplot2

evaluation of the quality

Genome

paper: # A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies

  • string-based assemblers
  • overlap-layout-consensus (OLC) assemblers
  • De Bruijn graph-based assemblers: good for large short-reads dataset

Transcriptome

  • transrate
  • detonate

polyploid SNP calling

polycat## evaluate the quality of assembly

Genome Evolution and Genomics

Phylogenetics

Population Genetics and phylogeography

Population genetics and genomics in R

  • STRUCTURE
  • PCA and smartPCA
  • Provean
  • BAD-Mutations
  • HAPMIX
  • ∂a∂i
  • DIY-ABC
  • fastSimCoal
  • SLiM2
  • TCS
  • Ecological Niche Modeling
    • WorldClim
  • SMC++ github
    • a program for estimating the size history of populations from whole genome sequence data. * ABBA-BABA test
    • also called the D-statistic
    • tests for ancient admixture

RNASeq

wiki tools for RNASeq https://en.wikipedia.org/wiki/List_of_RNA-Seq_bioinformatics_tools

RNASeq Course

Fastaq trim

denovo assembly

  • Volvet
  • Trinity
  • SOAPdenovoTrans

evaluation:

  • DETONATE score
  • TransRate
  • Ultra-conserved elements (UCEs)

RNA-Seq aligner => generate SAM file RNASeq Course

Differential Expression


HISAT vs STAR vs TopHat(https://plus.google.com/+MarkZiemann1/posts/FcoyDzJ7khU) 基本上差不多

Samtools: SAM to BAM

evaluation of RNA-Seq alignment

  • mapped reads %

Functional analysis (enrichment, co-expression)

eQTL

Enrichment and Coexpression

eQTL

Alternative splicing

  • classification

    • Skipped exon
      • A-B-C
      • A-_-C
    • Alternative 5' splice site
      • A-C
      • B-C
    • Alternative 3' splice site
      • A-B
      • A-C
    • Mutually exclusive exons
      • A-B-D
      • A-C-D
    • Retained intron
      • A-B
      • A-(intron)-B
  • Steps:

    • exon reads GDE
    • isoforms GDE
    • junctions

Tools

Application of RNA-Seq in Diagnostics

Translating RNA sequencing into clinical diagnostics: opportunities and challenges

Examples using RNA-Seq for Diagnosis:

New Directions of RNA-Seq analysis

  • RNA-Seq in different tissue

  • Different time

  • single-cell RNA-Seq

  • integrate RNA-Seq with GWAS

      # Quantitative Genetics, GWAS, and Statistics
    
    • PLink (tutorial[http://zzz.bwh.harvard.edu/plink/tutorial.shtml#t6)

The candidate gene approach to conducting genetic association studies focuses on associations between genetic variation within pre-specified genes of interest and phenotypes or disease states. This is in contrast to genome-wide association studies (GWAS), which scan the entire genome for common genetic variation.


  • ascertainment bias: make sure use "clearly defined phenotypes for case and control"
  • population stratification:
    • subtle ancestral differences in case and control __ gene~ethnicity association
      • Using Principal Components Analysis (PCA)as a Surrogate for Genetic Ancestry
      • adjusting for principal components of genetic ancestry.
    • gender, env
  • Bonferroni correction 5*10-7
    • Standard Bonferroni correction
    • Test each SNP at the α* =α /m1 level
    • Where m1 = number of markers tested
    • Assuming m1 = 500,000, a Bonferroni-corrected threshold of α*= 0.05/500,000 = 1x10–7
    • Conservative when the tests are correlated
  • HWE: For a rare disease (or no/modest genetic effects), genotype frequencies in controls should (nearly) follow HWE -imputation: Using LD and Hapmap/1000 Genomes to Impute Untyped SNPs

Phenotyping

plantCV