2 Info about the data and ideas about a ChIP seq analysis pipeline (20171203) - jlanillos/ChIPseq GitHub Wiki

This entry includes data and project organization, data trimming and QC and looking for info about mapping and ChIP-seq analysis pipelines:

Data

Illumina Single-end sequencing ~40M reads per experiment (fastq file). During the whole project, I will work with 3 files: 2 replicates of ER-alpha (MCF7 cell line) + 1 control sample.
Renamed the sample file for easier readability: control, sample1, sample2 (Find the actual names in SampleID.txt)

Reference Genome Download Source: GRCh38.p11

Data trimming and quality check

Trim Galore! I ran TrimGalore to trim the data. For HT sequencing data, quality control on the data is always recommended: poor qualities (affect mapping), avoid incorrect methylation calls, adapter contamination.

Link to installation:

wget "https://github.com/FelixKrueger/TrimGalore/archive/0.4.5.zip"

(available at UPPMAX): module add TrimGalore

FastQC, and check the report: I checked the quality of the data by using FastQC, available at UPPMAX.

Installation --> https://www.bioinformatics.babraham.ac.uk/projects/fastqc/INSTALL.txt Usage: ./fastqc file_location/file_name.fq. Output: html and .zip with report of the quality. (~1 min per file) Installation in UPPMAX: module add FastQC

MULTIQC: http://multiqc.info/ Then, I created a report of all the samples analyzed by the team with multiQC.

Explore ChIP-seq data pipelines

A) Nice description of the basic steps to analyze ChIP-Seq data (this is a way):

One, align your fastq reads to the genome (BWA and Bowtie2 were considered)

Note: "from Bowtie2 summary statistics...If alignment percentage is low (I'd say < 90% or 80%), something might be off with your data. This can be that your reads need adapter trimming --> check the quality of your Fastq file with FastQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/"

Two, filter your SAM file for nonuniquely aligned reads that have two or less mismatches:samtools
Three, convert your SAM file to a BAM file and sort it:

"...it's a lot easier to work with these smaller binary BAM files than with the huge text SAM files."

SAM> Convert to BAM >sort the BAM (by chromosome)

Four, remove potential PCR duplicates from your BAM file: samtools rmdup, PicardTools

"...multiple reads map to the same location in the genome, keep only one. This removes all potential PCR duplicates..."

We got the final BAM file :)

Five, index your BAM file, if necessary: samtools index

"...to load your BAM file in IGV browser or UCSC browser..."

Six: convert your BAM file into a BED file if desired: bedtools bamtobed

"...BED files are simple text, tab-delimited files defining genomic coordinates. BED files are bigger files but can be easier to analyze and manipulate..."

Step Seven: call peaks with a peak finder: MACS2, QuEST, JAMM, HOMER...

Source: https://github.com/mahmoudibrahim/JAMM/wiki/ChIP-Seq-Alignment-and-Processing-Pipeline

B) Another ChIP-Seq Data Analysis Pipeline (another way):

A ChIP-Seq Data Analysis Pipeline Based on Bioconductor Packages link "...the standard ChIP-Seq data analysis process consists of a quality check (QC), mapping, peak calling, statistical analysis, annotation, and visualization...". A ChIP-Seq Data Analysis Pipeline Based on Bioconductor Packages. Seung-Jin Park, et al. Genomics Inform. 2017 Mar, doi: 10.5808/GI.2017.15.1.11. PMCID: PMC5389943

VERY IMPORTANT CONSIDERATIONS when working with ChIP-Seq (very useful tips) http://biocluster.ucr.edu/~rkaundal/workshops/R_feb2016/ChIPseq/ChIPseq.html