Fast Higashi Usage - ma-compbio/Higashi GitHub Wiki

Data processing

Higashi processing (input data to sparse contact maps)

Fast-Higashi shares the same set of input as Higashi, and thus uses the same code to parse them.

If you plan to run Higashi later

Run the following commands to process the input data (only needs to be run once).

from higashi.Higashi_wrapper import Higashi
higashi_model = Higashi(config_path)
higashi_model.process_data()

This function will finish the following tasks:

generate a dictionary that'll map genomic bin loci to the node id.
extract data from the data.txt and turn that into the format of hyperedges (triplets)
create contact maps based on sparse scHi-C for visualization, baseline model, and generate node attributes

The above function is also equivalent to

higashi_model.generate_chrom_start_end()
higashi_model.extract_table()
higashi_model.create_matrix()

Before each step is executed, a message would be printed indicating the progress, which helps the debugging process.

If you plan to run Fast-Higashi only (individual contact pairs per cell)

See next subsection (model.fast_process_data() part)

Fast-Higashi only processing (sparse contact maps to sparse tensors)

from fasthigashi.FastHigashi_Wrapper import *
# Initialize the model
model = FastHigashi(config_path=config_path,
	             path2input_cache="/work/magroup/ruochiz/fast_higashi_git/pfc_500k",
	             path2result_dir="/work/magroup/ruochiz/fast_higashi_git/pfc_500k",
	             off_diag=100,
	             filter=False,
	             do_conv=False,
	             do_rwr=False,
	             do_col=False,
	             no_col=False)

# Processing from contacts to sparse mtx
model.fast_process_data()

# Pack from sparse mtx to tensors
model.prep_dataset(batch_norm=True)

**required arguments:**
config_path           The path to the configuration JSON file that you created.
path2input_cache      The path to the directory where the cached tensor file will be stored
path2result_dir       The path to the directory where the cached tensor file will be stored
off_diag              Maximum No of diagonals to consider. When set as 100, the 0-100th diagonal would 
                      be considered
filter                Whether only use cells that pass the quality control standard to learn the meta-interactions, 
                      and then infers the embeddings for the result of the cells. 
do_conv               Whether use linear convolution or not.
do_rwr                Whether use partial random walk with restart or not
do_col                Whether use sqrt_vc normalization or not, the program 
                      would automatically uses it when needed
no_col                Whether force the program to not use sqrt_vc normalization, the program would automatically uses it when needed
batch_norm            Whether uses batch corrected normalization or not

OK, how to select those important parameters:

filter : recommended to be True, unless you found that there are little to None cells pass QC control (You'll see a printed line that summarize good cells, bad cells, total cells)
do_conv:
1. Do your data have similar coverage or contact pairs per cell to Ramani et al. and Kim et al.? If so, True
2. Are you running Fast-Higashi on resolution higher (bin size smaller) than 500kb? If so, True
3. Otherwise, False
do_rwr: recommended to be True, but if you have high quality data (similar or higher contact pairs per cell than sn-m3c-seq human PFC data), test out False first, as it runs faster and sometimes gives better result at resolution lower or equal to 500kb.
do_col: recommended to be False, this option exists for debugging purpose
no_col: recommended to be False, this option exists for debugging purpose
batch_norm: if you have batch_id provided in the configuration JSON file, set it to be True

Run the model

model.run_model(dim1=.6,
                rank=256,
                n_iter_parafac=1,
                extra="")

**required arguments:**
dim1           The scaling factor to calculate the chromosome specific embeddings. For a chromosome 
               with n bins, studied at a resolution of xbp, its chromosome specific embedding size 
               will be int(dim1 * n * x / 1000000)
rank           Size of the meta-embedding size (embeddings shared across all chromosomes)
n_iter_parafac Number of iterations for parafac within each Fast-Higashi iteration
extra          Extra annotation strings when saving the results

Get the results

embedding = model.fetch_cell_embedding(final_dim=256,
	                               restore_order=False)

Fetch the embeddings, the final_dim argument should be something equal or smaller than the rank parameter when running the Fast-Higashi program. restore_order argument means whether to generate cell embeddings that is the same order as the input scHi-C data. By default, in Fast-Higashi cells would be sorted such that cells with good quality would be placed before those that don't pass quality check. Such order is usually not consistent with the input cell order. Set this as True, to generate cell embeddings that have the same order.

(Optional!!!) Additional correction of the batch effects

model.correct_batch_linear(embedding=embedding, 
                                               var_to_regress='exp')

Linear correction of the batch effects or read coverage bias. var_to_regress should be the key name in the label_info.pickle file you provided that corresponds to variables / conditions of each file that will be regressed out.

output format

The embeddings are stored as a dictionary and can be retrieved by embedding = model.embedding_storage or the embedding = model.fetch_cell_embedding.

By default it has the following keys: 'embed_raw', 'embed_l2_norm', 'embed_correct_coverage_fh', 'embed_l2_norm_correct_coverage_fh'. These correspond to embeddings, embeddings after l2 normalization, embedding after linear correction of coverage, embedding after linear correction of coverage + l2 normalization.

Now, if you run the optionally linear correction of additional variable 'abc', there will be two additional keys 'embed_correct_abc' and 'embed_l2_norm_correct_abc'.

So, for these many embeddings, which one to use? It is suggested to try 'embed_l2_norm' or 'embed_l2_norm_correct_coverage_fh'.

Although technically 'embed_l2_norm_correct_coverage_fh' offers embeddings that are unbiased towards sequencing depths of cells, for some datasets, the cell type information is highly correlated with the sequencing depths. A simple regression would lead to much worse embedding results. This is caused by the non-orthogonality of biological variance and technical variance, and usually happens when two scHi-C libraries contain different (or even non-overlapping) cell types but also have different sequencing depths. Thus, it is suggested to test out both embeddings and see which gives more reasonable results.