October 2021 - Bozhie/transcription-modeling GitHub Wiki

10.21.2021

Jupyter notebook Commit: Relationship between Expression and CTCF binding to TSS

Questions: I am working on re-creating the relationship between differential expression and the CTCF binding TSS (Elphege Fig. 6 b), but I'm a little stuck. I built bedframes and did some intersections for finding the closest TSS for both RNA-seq and ChiP-seq data, so that I can compare the differential expression to the binding location of CTCF. I'm having trouble figuring out how to connect these two datasets.

  • Notes for two approaches in notebook

Next Steps: Geoff Suggested the following:

For the Nora 2017 Fig. 6b, I think the easiest approach may involve pybbi: https://github.com/nvictus/pybbi. Sorry I forgot to mention that package!

This can extract the signal from a bigwig around a set of locations (here, the positions of TSS you parsed into RNA_seq_bf), either with BBIFile.stackup or bbi.stackup syntax.

So this could involve: 0. Figuring out which bigWig they used for this figure.

  1. expanding the TSS positions to be +/- 1kb
  2. feeding this to bbi stackup to extract a #TSS x #bins matrix of CTCF signal from the bigWig.
  3. sort the resulting matrix by RNA log-fold change (the other column) to get something that looks like 6b.
  4. visualize with matshow() or similar

The left part of (c) just looks like an average over all rows in that above matrix that are downregulated. For the right part of (c) we'd need to look up which set of motifs they were using.

Paper/Resources

CTCF as a multifunctional protein in genome regulation and gene expression

  • Describes different functions of CTCF, might be useful to reference to see how these different transcriptional regulation functions display in the data
  • Knowing where CTCF binds vs the impact it has on transcription when binding there (or, binding that has unknown impact on gene regulation and vice-versa) will help when trying to track down new trends

Collection: The 3D genome

10.25.2021

Todo&Goals for this week

Recreating Fig 6b & Using pybbi

Purpose: Following geoff's suggestions for trying to create aggregate data for sorting CTCF

Notes:

  • It looks like

10.26.2021

Recreating Fig 6b (still)

Notes/Questions:

  • How do I choose number of bins for stackup? BBIFile.stackup(chroms, starts, ends, [bins [, missing [, oob, [, summary]]]]) -> 2D numpy array
    • My understanding: stackup searches the BBIFile, and collects the values from the rows of the BBIFile that fit between the regions specified by chroms, starts, ends. So, if I do not specify number of bins, bins=length of [chroms,starts,ends] ?
    • Would I want these bins to be different for any reason?
    • from README: "Summary querying is supported by specifying the number of bins for coarsening. The summary statistic can be one of: 'mean', 'min', 'max', 'cov', 'std', 'or 'sum'. (default = 'mean')"
    • Maybe should choose 'sum'?

Issues

  • Trying to run BBIFile.stackup appears to crash the jupyter kernel
    • I was using the length of closest_RNAseq_TSS_window as the number of bins when it crashed instantly.
    • Also tried with bins=10, got ValueError: Start exceeds the chromosome length, 159599783.
    • Note: I had Kernel Crashed error when trying to read another file, but I think Geoff said it's because it's binary f = '/scratch/pokorny/Dixon_2012/GSE35156_GSM862723_hESC_HindIII_HiC.nodup.summary.txt'

boundary_regions = pd.read_table(f) boundary_regions[0:10]

Next Steps

  • Added section: Using only bioframe: Visualize Change of DEGs versus the distance between their TSS and nearest CTCF binding site
  • Abandoning bbi for now until I can get more insight on it, going to try to generate some other relationships since I've been stuck on this one data set for a while

Tutorial:

https://colab.research.google.com/drive/17E4h5aAOioh5DiTo7MZg4hpL6Z_0FyWr#scrollTo=qVh9frJDgVQ-

10.27.21

Visualizing relationship between CTCF-binding and Differential expression using pandas/bioframe only

Purpose: Want to generate a scatter plot to visualize relationship, similar to Luppino Fig. 5(e) log2(fold change) of DEGs versus the distance between their TSS and the center of the nearest domain boundary.

Jupyter Header: Using only bioframe: Visualize Change of DEGs versus the distance between their TSS and neares CTCF binding site

Process/Notes:

  • Created two dataframes with closest TSS: one from RNA_seq_FPKM, and used the subset of TSS intervals from that mapping to find the closest CTCF-bound in the WT/untagged ChIP-exo reads.
    • Using untagged because we want to see where the CTCF usually binds, not where it's missing
  • After mapping closest CTCF bound to TSS of RNA_seq reads (also determined by bioframe.closest()), saw that multiple CTCF reads were being mapped to each TSS. Used bioframe clustering to group by each unique TSS, and took average of these. Called avgDist.
    • Note: could also maybe take the min() instead of avg()
  • Merged twoDay_RNAseq_change_TSSannt and avgDist with an inner join on the TSS coordinates, then scatter plot :D files/images/CTCF-bound_TSS_vs_dFPKM_2_days_auxin_dCTCF.png

Editing wiki/adding images

  • Cloned wiki repo to be able to add images and save output files to wiki page directly, then reference
    • issue: image does not appear?
  • Also tried adding a files/ directory to store output files, images, data produced from this project
    • saving CTCF-bound_TSS_vs_dFPKM_2_days_auxin_dCTCF.png does not seem to display in github? Maybe I'm not exporting right, look at jupyter notebook if you want it

Next Steps:

  • Can do similar analysis varying the time-points (after x days of auxin-induced) to see how this changes over time.
  • Could overlay the ChIP-exo data from the dCTCF auxin-treated set on top of this data, see if whether the gene expression is affected is correlated with the loss in signal from CTCF
  • For quantification, may want to use the ChIPexo_density / the bigwig data instead ?

Todo

10.29.21

Fixing TSS annotations to use mm9

Purpose: Elphege et al 2017 used mm9 (mouse) genome build to assess TSS locations, and for all of their alignments. I had downloaded human genome annotations (hg38) from refTSS

Process/Notes:

todo:

  • finish making tss for mm9 transcripts, redo TSS visualization
  • motif analysis
  • RNA-seq pipeline (see how easy to use, start using as a standardized pipeline)