Single cell TAD calling - ma-compbio/Higashi GitHub Wiki

The Higashi-analysis pipeline runs on the imputed contact maps produced by Higashi-main. Please first execute Higashi-main as described here.

How Higashi calls single cell TAD-like domain boundaries

Step 1: Calculating single cell insulation scores

Higashi calls TAD-like boundaries using the insulation score based method (Crane et al., Nature, 2015) We denote the contact map as M and the window size as w. The insulation score s at position x is calculated as:

Step 2: Calling boundaries based on the insulation scores

A high insulation score indicates that the flanking regions across this genomic loci highly interact with each other, suggesting that this loci is in the middle of a TAD-like domain. On the contrary, a low insulation score indicates that the neighboring loci are well separated and this position is likely a domain boundary. Therefore, the domain boundaries are identified as the local minima of the insulation scores.

Note: There are different definitions for ratio insulation score (inter/intra or intra/inter). Thus, in some literatures, the local maxima of the insulation profile correspond to TAD boundaries.

Step 3: Calibrating boundaries across the population

To quantify the cell-to-cell variability of TAD-like domain boundaries in the cell population that can consists thousands of cells efficiently, we developed a new scalable computational strategy for "calibrating" domain boundaries called in the above step. Specifically, this calibration method would identify a set of domain boundaries shared across the cell population. The size of this shared boundaries set is predefined as K. Each single-cell domain boundary will then be assigned to one of these shared boundaries. In other words, the domain boundaries for each cell will be represented as a subset of this shared boundary set. The assignment and the shared boundary set are optimized such that the calibrated ones are similar to the original ones.

The panel a in the following figure here shows an example of calibration of single cell TAD-like domain boundaries.

This problem can be formally defined as:

where B_i and B*_i represent the single-cell domain boundaries before and after calibration for cell i and dis() represents the distance measurement of two boundaries. Each shared boundary thus acts as an anchor point such that the variable single-cell domain boundaries associated with it can be analyzed systematically across the cell population.

This problem setting shares similarities with the well-known K-means clustering. In Higashi, we developed a new metric called "cumulative insulation score" to measure the similarity between two boundaries (as shown in panel b of the above illustration figure). This metric is calculated as the area between the insulation score curve and the horizontal line at the minimum insulation score within the given region (See Higashi manuscript for the reason to adopt this metric).

Together, we alternate between optimizing the shared boundary set B*={b*_k} and optimizing the assignment c_{i,j} until convergence, i.e.,

Usage

Please execute the following code:

cd higashi/
python scTAD.py [-c CONFIG] [--window_ins INS] [--window_tad TAD] [-o OUTPUT] [--neighbor]

'
optional arguments:
-o, --output          Output file name (stored in the `temp_dur`). (default: scTAD.hdf5)
-n, --neighbor        Call TADs on the imputed maps with neighboring cell information utilized.


required arguments:
-c CONFIG             The path to the configuration JSON file that you created in the step.
--window_ins INS      The window size to calculate the insulation scores. Must be larger than the 
                      resolution of Hi-C. (default: 1000000)
--window_tad TAD      The window size to call the boundaries based on the insulation scores. The 
                      local minima of the insulation scores within the windows would be a potential 
                      domain boundary. Must be larger than the resolution of Hi-C. (default: 500000)
'

The code will calculate single cell insulation scores, call domain boundaries based on the insulation score, and calibrating the single cell TAD-like domain boundaries successively. The description of the output files can be found here