2__Bioinformatics - xinshuaiqi/My_books GitHub Wiki
Xinshuai Qi's Summary and Notes on Bioinformatics
-- by Xinshuai Qi
[TOC]
(last update on 12-4-2017)
Transcrptome Assembly
-
Raw reads clean
-
De novo Assembly
- Trinity
- 279 citation since 2011
- Velvet
- 6837 citation since 2008
- Trinity
-
SOAPdenovoTrans
- 359 since 2014
-
Reference-based Assembly
- Samtools
- VCFtools
- BCFtools
-
Picard
- manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.
Genome assembly
-
- paper
- 2458 citation since 2009
-
- designed for at least 2 short reads library, high coverage; not support polyploid
- graph based results
- from the Computational Research and Development group at the Broad Institute
- 773 citation since 2008
PacBio assembly
- FALCON
- HGAP
- developed by PacBio
- PBJelly
improvement
- PILON https://github.com/broadinstitute/pilon/wiki
- Quiver https://github.com/PacificBiosciences/GenomicConsensus
mummer4
evaluation of the quality
Genome
- QUEST
- paper
- 2013 citation: 879
- BUSCO: Benchmarking Universal Single-Copy Orthologs, named BUSCO. *
- PEAPRforcus on the error rate.
- GAGE (Genome Assembly Gold-standard Evaluations)
- data quality is more important than the assembler
- 各自软件差异很大
- string-based assemblers
- overlap-layout-consensus (OLC) assemblers
- De Bruijn graph-based assemblers: good for large short-reads dataset
Transcriptome
- transrate
- detonate
polyploid SNP calling
polycat## evaluate the quality of assembly
Genome Evolution and Genomics
- OrthoFinder
- Circos
- SnpEff
- Gene Ontology
- LeafJ
- LTR retriever
- Omictools
Phylogenetics
- RAxML
- PAML
- BEAST
- TreeMix
- Detecting trait-dependent evolutionary rate shifts in sequence sites
- (ABBA/BABA test)[http://www.popgen.dk/angsd/index.php/Abbababa]
Population Genetics and phylogeography
Population genetics and genomics in R
- STRUCTURE
- PCA and smartPCA
- Provean
- BAD-Mutations
- HAPMIX
- ∂a∂i
- DIY-ABC
- fastSimCoal
- SLiM2
- TCS
- Ecological Niche Modeling
- WorldClim
- SMC++ github
- a program for estimating the size history of populations from whole genome sequence data. * ABBA-BABA test
- also called the D-statistic
- tests for ancient admixture
RNASeq
wiki tools for RNASeq https://en.wikipedia.org/wiki/List_of_RNA-Seq_bioinformatics_tools
Fastaq trim
- Sickle
- SnoWhite
- Trimmonatic
denovo assembly
- Volvet
- Trinity
- SOAPdenovoTrans
evaluation:
- DETONATE score
- TransRate
- Ultra-conserved elements (UCEs)
RNASeq Course
RNA-Seq aligner => generate SAM fileDifferential Expression
-
TopHat
-
Cufflinks
-
- to use cufflinks, you need to set FLAGS while run.
-
- Tophat的升级版
* use hierarchical , large set of small indexes. NOT one global index for the genome.
-
build on Bowtie2
- use FM-index
-
kallisto and sleuth by pachterlab
- HISAT, StringTie and Ballgown
- A replacement of the old TOPHAT and Cufflinks solution.
HISAT vs STAR vs TopHat(https://plus.google.com/+MarkZiemann1/posts/FcoyDzJ7khU) 基本上差不多
Samtools: SAM to BAM
evaluation of RNA-Seq alignment
- mapped reads %
Functional analysis (enrichment, co-expression)
- Functional visualization --Guangchuang Yu UHK
- ClusterProfiler
Differential Expression
-
TopHat
-
Cufflinks
-
kallisto and sleuth by pachterlab
-
eQTL
Enrichment and Coexpression
- DAVID
- Gene Set Enrichment Analysis (GSEA)
- WGCNA (https://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/)
eQTL
Alternative splicing
-
classification
- Skipped exon
- A-B-C
- A-_-C
- Alternative 5' splice site
- A-C
- B-C
- Alternative 3' splice site
- A-B
- A-C
- Mutually exclusive exons
- A-B-D
- A-C-D
- Retained intron
- A-B
- A-(intron)-B
- Skipped exon
-
Steps:
- exon reads GDE
- isoforms GDE
- junctions
Tools
- RSEM
- aligns reads to transcripts using Bowtie
- Output isoform level expression level
- DEXSeq
- Cufflinks
- MATS
- SpliceR
Application of RNA-Seq in Diagnostics
Translating RNA sequencing into clinical diagnostics: opportunities and challenges
Examples using RNA-Seq for Diagnosis:
New Directions of RNA-Seq analysis
-
RNA-Seq in different tissue
-
Different time
-
single-cell RNA-Seq
-
integrate RNA-Seq with GWAS
# Quantitative Genetics, GWAS, and Statistics
- PLink (tutorial[http://zzz.bwh.harvard.edu/plink/tutorial.shtml#t6)
-
Candidate Gene Association Study
The candidate gene approach to conducting genetic association studies focuses on associations between genetic variation within pre-specified genes of interest and phenotypes or disease states. This is in contrast to genome-wide association studies (GWAS), which scan the entire genome for common genetic variation.
- ascertainment bias: make sure use "clearly defined phenotypes for case and control"
- population stratification:
- subtle ancestral differences in case and control __ gene~ethnicity association
- Using Principal Components Analysis (PCA)as a Surrogate for Genetic Ancestry
- adjusting for principal components of genetic ancestry.
- gender, env
- subtle ancestral differences in case and control __ gene~ethnicity association
- Bonferroni correction 5*10-7
- Standard Bonferroni correction
- Test each SNP at the α* =α /m1 level
- Where m1 = number of markers tested
- Assuming m1 = 500,000, a Bonferroni-corrected threshold of α*= 0.05/500,000 = 1x10–7
- Conservative when the tests are correlated
- HWE: For a rare disease (or no/modest genetic effects), genotype frequencies in controls should (nearly) follow HWE -imputation: Using LD and Hapmap/1000 Genomes to Impute Untyped SNPs