Task 5: Sigmoid colon and stomach distal regulatory activity - PatiGenius/EpigenomicsTask_MscOmics GitHub Wiki

Distal regulatory activity

In this final task we use the results from the previous section Task 4, in which we have obtained the peaks for both sigmoid colon and stomach that fall oustide gene body coordinates. This information will serve us to create a distal regulatory elements catalogue.

We create a folder regulatory_elements inside the epigenomics_uvic. Here we save all the subsequent results.

mkdir regulatory_elements: to create the directory.

We also create the rest of directories:

mkdir annotation

mkdir data

mkdir data/bed.files

mkdir data/bigBed.files

mkdir analyses

mkdir analyses/peaks.analyses

From the starting catalogue of open regions in each tissue, we select the ones that overlaps peaks of H3K27ac and H3K4me1 in each tissue. We get a list of candidate distal regulatory elements.

1. First, we need to obtain the peaks for H3K4me1 and H3K27ac. We select them from the metadata file available in the ChIP-seq folder.

For H3K4me1

grep -F H3K4me1 ../ChIP-seq/metadata.tsv | grep -F "bigBed_narrowPeak" | grep -F "pseudoreplicated_peaks" | grep -F "GRCh38" | awk 'BEGIN{FS=OFS="\t"}{print $1,$10,$22}' | sort -k2,2 -k1,1r | sort -k2,2 -u > analyses/bigBed.peaks.H3K4me1.txt

peaksH3K4me1

For H3K27ac

grep -F H3K27ac ../ChIP-seq/metadata.tsv | grep -F "bigBed_narrowPeak" | grep -F "pseudoreplicated_peaks" | grep -F "GRCh38" | awk 'BEGIN{FS=OFS="\t"}{print $1,$10,$22}' | sort -k2,2 -k1,1r | sort -k2,2 -u > analyses/bigBed.peaks.H3K27ac.txt

peaksH3K27ac

2. We need to obtain the bigBed files

For H3K4me1

cut -f1 analyses/bigBed.peaks.H3K27ac.txt | while read filename; do wget -P data/bigBed.files "https://www.encodeproject.org/files/$filename/@@download/$filename.bigBed" done

For H3K27ac

cut -f1 analyses/bigBed.peaks.H3K4me1.txt | while read filename; do wget -P data/bigBed.files "https://www.encodeproject.org/files/$filename/@@download/$filename.bigBed" done

The results are the following:

bigBedfiles_peaks

3. We may verify the md5 values

for file_type in bigBed; do ../bin/selectRows.sh <(cut -f1 analyses/"$file_type".peaks.*.txt) ../ChIP-seq/metadata.tsv | cut -f1,45 > data/"$file_type".files/md5sum.txt cat data/"$file_type".files/md5sum.txt | while read filename original_md5sum; do md5sum data/"$file_type".files/"$filename"."$file_type" | awk -v filename="$filename" -v original_md5sum="$original_md5sum" 'BEGIN{FS=" "; OFS="\t"}{print filename, original_md5sum, $1}' done > tmp mv tmp data/"$file_type".files/md5sum.txt awk '$2!=$3' data/"$file_type".files/md5sum.txt done

checkValues

We can see that they do correspond with the ones described above from the bigBed files.

4. Now, we need to convert the bigBed files to bed files in order to be able to manipulate data and use the intersect function from bedtools.

For H3K4me1

cut -f1 analyses/bigBed.peaks.H3K4me1.txt | while read filename; do bigBedToBed data/bigBed.files/"$filename".bigBed data/bed.files/"$filename".bed done

For H3K27ac

cut -f1 analyses/bigBed.peaks.H3K27ac.txt | while read filename; do bigBedToBed data/bigBed.files/"$filename".bigBed data/bed.files/"$filename".bed done

At this point we have available 4 bed files, obtained for 2 histone marks (H3K4me1,H3K27ac) and 2 tissues (stomach, sigmoid colon).

Now we can perform the intersect between the obtained peaks falling outside genes coordinates and the peaks of each one of the marks both in stomach and in simgoid colon.

5. Performing the intersect: from the starting catalogue of open regions (peaks obtained in task 4 ATAC-seq falling outside genes coordinates) we select those that overlap peaks of both histone marks in each tissue.

We want to retrieve the ATAC-seq peaks (database A,-a) that overlap both the peaks in H3K4me1 and H3K27ac (databases B, -b) in each tissue.

For sigmoid colon

bedtools intersect -a ../ATAC-seq/analyses/peaks.analyses/genes.sigmoid_colon.outside.gene.coord.txt -b data/bed.files/ENCFF724ZOF.bed data/bed.files/ENCFF872UHN.bed -u | sort -u > analyses/peaks.analyses/genes.peaks.overlap.sigmoid_colon.txt

intSigmoid

For stomach

bedtools intersect -a ../ATAC-seq/analyses/peaks.analyses/genes.stomach.outside.gene.coord.txt -b data/bed.files/ENCFF844XRN.bed data/bed.files/ENCFF977LBD.bed -u | sort -u > analyses/peaks.analyses/genes.peaks.overlap.stomach.txt

intStomach

We focus on regulatory elements located on chromosome 1 and generate a regulatory.elements.starts.tsv file that contains the name of the regulatory region (original ATAC-seq peak) and the start (5') coordinate of the region

From the results obtained in the previous exercise, we may select just those from chromosome 1. We print in the first column the information about the original ATAC-seq peak and in the second column, the start coordinate.

For stomach: 8602 regulatory elements located on chromosome 1

grep -F “chr1” analyses/peaks.analyses/genes.peaks.overlap.stomach.txt | awk 'BEGIN{FS=OFS="\t"} {print $4,$2}’ > analyses/ peaks.analyses/regulatory.elements.starts.stomach.tsv

regStomach

For sigmoid colon: 10734 regulatory elements located on chromosome 1

grep -F “chr1” analyses/peaks.analyses/genes.peaks.overlap.sigmoid_colon.txt | awk 'BEGIN{FS=OFS="\t"} {print $4,$2}’ > analyses/ peaks.analyses/regulatory.elements.starts.sigmoid_colon.tsv

regSigmoid

We focus on protein-coding genes located on chromosome 1. From the BED file of gene body coordinates that we have generated and used before (annotation/gencode.v24.protein.coding.gene.body.bed), we prepare a tab-separated file called gene.starts.tsv which stores the name of the gene in the first column and the start coordinate in the second one.

We use the next code:

awk 'BEGIN{FS=OFS="\t"} $1==”chr1” {if ($6=="+"){start=$2} else {start=$3}; print $4, start}' ../ChIP-seq/annotation/gencode.v24.protein.coding.gene.body.bed > analyses/gene.starts.tsv

chr1-PC

We download the python script to bin folder inside epigenomics_uvic directory and have a look to the help page

cp ../../Downloads/gent.distance.py ../bin: we download the python script and then move it to from Downloads to bin folder.

python ../bin/get.distance.py -h: we look the help page

The script uses two arguments: an input (file gene.starts.tsv) and a start coordinate (5' coordinate of a regulatory element). We may complete the script with the instructions to be able to obtain, for a given start coordinate, the closest gene, the start of this gene and the distance to the regulatory element.

script.py

We check if we have correctly defined the variables by running the following command:

python ../bin/get.distance.py --input analyses/gene.starts.tsv --start 980000: we are expected to obtain as output ENSG00000187642.9-982093-2093.

CheckScript

Now, for each regulatory element contained in the file regulatory.elements.starts.$tissue.tsv, we retrieve the closest gene and the distance to this gene by using the python script.

We run the next code:

for tissue in stomach sigmoid_colon; do cat analyses/peaks.analyses/regulatory.elements.starts.$tissue.tsv | while read element start; do python ../bin/get.distance.py --input analyses/gene.starts.tsv --start $start done > analyses/peaks.analyses/regulatoryElements.genes.distances.$tissue.tsv done

And obtain two files: regulatoryElements.genes.distances.stomach.tsv and regulatoryElements.genes.distances.sigmoid_colon.tsv.

The strucutre of the output is the next one (we use as example the file corresponding to sigmoid colon):

Structure

At last, we use R to compute the mean and the median of the distances stored in regulatoryElements.genes.distances.$tissue.tsv files

R: we run R

getwd(): to be sure that we are in the correct working directory

stomach.dist<- read.table("analyses/peaks.analyses/regulatoryElements.genes.distances.stomach.tsv", sep="\t") sigmoid.dist<- read.table("analyses/peaks.analyses/regulatoryElements.genes.distances.sigmoid_colon.tsv", sep="\t")

I change the column names to the corresponding information that each variable contains:

colnames(stomach.dist)<- c("Closest.Gene","Start","distance")

colnames(sigmoid.dist)<- c("Closest.Gene","Start","distance")

We perform the mean and the median of the distance for each case:

mean.stomach<- mean(stomach.dist[,"distance"])

mean.sigmoid<- mean(sigmoid.dist[,"distance"])

median.stomach<- median(stomach.dist[,"distance"])

median.sigmoid<- median(sigmoid.dist[,"distance"])

We finally show the results by displaying them on a list. The code appears in the next image, together with the results:

resultsR