2 Info about the data and ideas about a ChIP seq analysis pipeline (20171203) - jlanillos/ChIPseq GitHub Wiki
This entry includes data and project organization, data trimming and QC and looking for info about mapping and ChIP-seq analysis pipelines:
Data
- Illumina Single-end sequencing ~40M reads per experiment (fastq file). During the whole project, I will work with 3 files: 2 replicates of ER-alpha (MCF7 cell line) + 1 control sample.
- Renamed the sample file for easier readability: control, sample1, sample2 (Find the actual names in SampleID.txt)
Reference Genome Download Source: GRCh38.p11
Data trimming and quality check
- Trim Galore! I ran TrimGalore to trim the data. For HT sequencing data, quality control on the data is always recommended: poor qualities (affect mapping), avoid incorrect methylation calls, adapter contamination.
Link to installation:
wget "https://github.com/FelixKrueger/TrimGalore/archive/0.4.5.zip"
(available at UPPMAX): module add TrimGalore
- FastQC, and check the report: I checked the quality of the data by using FastQC, available at UPPMAX.
Installation --> https://www.bioinformatics.babraham.ac.uk/projects/fastqc/INSTALL.txt
Usage: ./fastqc file_location/file_name.fq. Output: html and .zip with report of the quality. (~1 min per file)
Installation in UPPMAX: module add FastQC
- MULTIQC: http://multiqc.info/ Then, I created a report of all the samples analyzed by the team with multiQC.
Explore ChIP-seq data pipelines
A) Nice description of the basic steps to analyze ChIP-Seq data (this is a way):
- One, align your fastq reads to the genome (BWA and Bowtie2 were considered)
Note: "from Bowtie2 summary statistics...If alignment percentage is low (I'd say < 90% or 80%), something might be off with your data. This can be that your reads need adapter trimming --> check the quality of your Fastq file with FastQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/"
-
Two, filter your SAM file for nonuniquely aligned reads that have two or less mismatches:samtools
-
Three, convert your SAM file to a BAM file and sort it:
"...it's a lot easier to work with these smaller binary BAM files than with the huge text SAM files."
SAM> Convert to BAM >sort the BAM (by chromosome)
- Four, remove potential PCR duplicates from your BAM file: samtools rmdup, PicardTools
"...multiple reads map to the same location in the genome, keep only one. This removes all potential PCR duplicates..."
We got the final BAM file :)
- Five, index your BAM file, if necessary: samtools index
"...to load your BAM file in IGV browser or UCSC browser..."
- Six: convert your BAM file into a BED file if desired: bedtools bamtobed
"...BED files are simple text, tab-delimited files defining genomic coordinates. BED files are bigger files but can be easier to analyze and manipulate..."
- Step Seven: call peaks with a peak finder: MACS2, QuEST, JAMM, HOMER...
Source: https://github.com/mahmoudibrahim/JAMM/wiki/ChIP-Seq-Alignment-and-Processing-Pipeline
B) Another ChIP-Seq Data Analysis Pipeline (another way):
A ChIP-Seq Data Analysis Pipeline Based on Bioconductor Packages link "...the standard ChIP-Seq data analysis process consists of a quality check (QC), mapping, peak calling, statistical analysis, annotation, and visualization...". A ChIP-Seq Data Analysis Pipeline Based on Bioconductor Packages. Seung-Jin Park, et al. Genomics Inform. 2017 Mar, doi: 10.5808/GI.2017.15.1.11. PMCID: PMC5389943
VERY IMPORTANT CONSIDERATIONS when working with ChIP-Seq (very useful tips) http://biocluster.ucr.edu/~rkaundal/workshops/R_feb2016/ChIPseq/ChIPseq.html