October 2021 - Bozhie/transcription-modeling GitHub Wiki

10.21.2021

Jupyter notebook Commit: Relationship between Expression and CTCF binding to TSS

Questions: I am working on re-creating the relationship between differential expression and the CTCF binding TSS (Elphege Fig. 6 b), but I'm a little stuck. I built bedframes and did some intersections for finding the closest TSS for both RNA-seq and ChiP-seq data, so that I can compare the differential expression to the binding location of CTCF. I'm having trouble figuring out how to connect these two datasets.

Notes for two approaches in notebook

Next Steps: Geoff Suggested the following:

For the Nora 2017 Fig. 6b, I think the easiest approach may involve pybbi: https://github.com/nvictus/pybbi. Sorry I forgot to mention that package!

This can extract the signal from a bigwig around a set of locations (here, the positions of TSS you parsed into RNA_seq_bf), either with BBIFile.stackup or bbi.stackup syntax.

So this could involve: 0. Figuring out which bigWig they used for this figure.

expanding the TSS positions to be +/- 1kb

feeding this to bbi stackup to extract a #TSS x #bins matrix of CTCF signal from the bigWig.

sort the resulting matrix by RNA log-fold change (the other column) to get something that looks like 6b.

visualize with matshow() or similar

The left part of (c) just looks like an average over all rows in that above matrix that are downregulated. For the right part of (c) we'd need to look up which set of motifs they were using.

Paper/Resources

CTCF as a multifunctional protein in genome regulation and gene expression

Describes different functions of CTCF, might be useful to reference to see how these different transcriptional regulation functions display in the data
Knowing where CTCF binds vs the impact it has on transcription when binding there (or, binding that has unknown impact on gene regulation and vice-versa) will help when trying to track down new trends

Collection: The 3D genome

A bunch of relevant papers here
Looking at this one for inspiration: Long-range enhancer–promoter contacts in gene expression control

10.25.2021

Todo&Goals for this week

Download pybbi https://github.com/nvictus/pybbi
Finally recreate Elphege 6 b-e
Write other questions we could ask with this data
What can we look at with Liu, Rao, Karissa's data?
Read / find sources for inspo with quantifying changes in expression
Go through this tutorial https://colab.research.google.com/drive/17E4h5aAOioh5DiTo7MZg4hpL6Z_0FyWr#scrollTo=WNT_Au-dAP8a

Recreating Fig 6b & Using pybbi

Purpose: Following geoff's suggestions for trying to create aggregate data for sorting CTCF

Notes:

It looks like

10.26.2021

Recreating Fig 6b (still)

Notes/Questions:

How do I choose number of bins for stackup? BBIFile.stackup(chroms, starts, ends, [bins [, missing [, oob, [, summary]]]]) -> 2D numpy array
- My understanding: stackup searches the BBIFile, and collects the values from the rows of the BBIFile that fit between the regions specified by chroms, starts, ends. So, if I do not specify number of bins, bins=length of [chroms,starts,ends] ?
- Would I want these bins to be different for any reason?
- from README: "Summary querying is supported by specifying the number of bins for coarsening. The summary statistic can be one of: 'mean', 'min', 'max', 'cov', 'std', 'or 'sum'. (default = 'mean')"
- Maybe should choose 'sum'?

Issues

Trying to run BBIFile.stackup appears to crash the jupyter kernel
- I was using the length of closest_RNAseq_TSS_window as the number of bins when it crashed instantly.
- Also tried with bins=10, got ValueError: Start exceeds the chromosome length, 159599783.
- Note: I had Kernel Crashed error when trying to read another file, but I think Geoff said it's because it's binary f = '/scratch/pokorny/Dixon_2012/GSE35156_GSM862723_hESC_HindIII_HiC.nodup.summary.txt'

boundary_regions = pd.read_table(f) boundary_regions[0:10]

Next Steps

Added section: Using only bioframe: Visualize Change of DEGs versus the distance between their TSS and nearest CTCF binding site
Abandoning bbi for now until I can get more insight on it, going to try to generate some other relationships since I've been stuck on this one data set for a while

Tutorial:

https://colab.research.google.com/drive/17E4h5aAOioh5DiTo7MZg4hpL6Z_0FyWr#scrollTo=qVh9frJDgVQ-

10.27.21

Visualizing relationship between CTCF-binding and Differential expression using pandas/bioframe only

Purpose: Want to generate a scatter plot to visualize relationship, similar to Luppino Fig. 5(e) log2(fold change) of DEGs versus the distance between their TSS and the center of the nearest domain boundary.

Jupyter Header: Using only bioframe: Visualize Change of DEGs versus the distance between their TSS and neares CTCF binding site

Process/Notes:

Created two dataframes with closest TSS: one from RNA_seq_FPKM, and used the subset of TSS intervals from that mapping to find the closest CTCF-bound in the WT/untagged ChIP-exo reads.
- Using untagged because we want to see where the CTCF usually binds, not where it's missing
After mapping closest CTCF bound to TSS of RNA_seq reads (also determined by bioframe.closest()), saw that multiple CTCF reads were being mapped to each TSS. Used bioframe clustering to group by each unique TSS, and took average of these. Called avgDist.
- Note: could also maybe take the min() instead of avg()
Merged twoDay_RNAseq_change_TSSannt and avgDist with an inner join on the TSS coordinates, then scatter plot :D files/images/CTCF-bound_TSS_vs_dFPKM_2_days_auxin_dCTCF.png

Editing wiki/adding images

Cloned wiki repo to be able to add images and save output files to wiki page directly, then reference
- issue: image does not appear?
Also tried adding a files/ directory to store output files, images, data produced from this project
- saving CTCF-bound_TSS_vs_dFPKM_2_days_auxin_dCTCF.png does not seem to display in github? Maybe I'm not exporting right, look at jupyter notebook if you want it

Next Steps:

Can do similar analysis varying the time-points (after x days of auxin-induced) to see how this changes over time.
Could overlay the ChIP-exo data from the dCTCF auxin-treated set on top of this data, see if whether the gene expression is affected is correlated with the loss in signal from CTCF
For quantification, may want to use the ChIPexo_density / the bigwig data instead ?

Todo

Read https://www.nature.com/articles/s41588-018-0295-5
Maybe read https://www.nature.com/articles/s41576-019-0128-0
compare differential expression using RNA-seq data responding to other

10.29.21

Fixing TSS annotations to use mm9

Purpose: Elphege et al 2017 used mm9 (mouse) genome build to assess TSS locations, and for all of their alignments. I had downloaded human genome annotations (hg38) from refTSS

Process/Notes:

Re-read paper that describes how dataset refTSS was generated https://www.sciencedirect.com/science/article/pii/S0022283619302530#t0005
- Data set appears to have mm10 only, they did a conversion of all mm9 annotations into mm10
  - raw data files here: http://reftss.clst.riken.jp/datafiles/
- What might be nice about this dataset is that it combines annotations from multiple sources (FANTOM5 promoter atlas, ENCODE RAMPAGE, EPDnew, DBTSS)
- Might instead either follow their method for adding annotations (using just one of the data sources) or look into other methods
Getting mm9 files
- From Elphege "mm9 RefSeq was used as reference gene set and adapters were trimmed", will try to use RefSeq
- Downloaded RefSeq mm9 genome (all these files and these ones have GFF/GTF intervals)
- Also downloaded ucsc GFF and GTF Files (I think these intervals might be easier to work with, instead of actual nucleotide seqs)
Found a cool collection of tutorials/tools, most are python-compatible (a good resource): https://hemtools.readthedocs.io/en/latest/index.html
Building TSS out of mm9 genome file downloads

todo:

finish making tss for mm9 transcripts, redo TSS visualization
motif analysis
RNA-seq pipeline (see how easy to use, start using as a standardized pipeline)