Configure the parameters - ma-compbio/Higashi GitHub Wiki

All customizable parameters are stored in a JSON config file. The path to this JSON config file will be needed when running the program.

For all parameters below, when certain parameter is marked as Optional it means you can left those parameters out when they are not applicable.

For examples of the configuration JSON file, see the tutorials linked in this wiki.

Input data related parameters

params	Type	Required/Optional	description	example
config_name	str	Required if you will be using Higashi-vis, otherwise Optional	Name of this configuration, will be used in visualization tool	"sn-m3C-seq-with_meth"
data_dir	str	Required	Directory where the data are stored	"/sn-m3C-seq"
input_format	str	Optional	How the data are stored. Can either be "higashi_v1" or "higashi_v2". "higashi_v1" stands for storing the scHi-C dataset as one big table named as data.txt. "higashi_v2" stands for storing contact pairs as individual tables for each cell, and list the path to these files in the filelist.txt	"higashi_v1"
header_included	bool	Required when `input_format`="higashi_v2"	whether the header of the tab is included in each table	true
contact_header	list	Required when `input_format`="higashi_v2" and `header_included` is false	The header of the contact pairs. Must include ["chrom1", "pos1", "chrom2","pos2"], when "count" is not included, the program assumes count=1 for all contact pairs	["chrom1", "pos1", "chrom2", "pos2", "count"]
structured	bool	Required	Whether the data.txt file is structured (interaction pairs of a cell i is successive in the dataframe not randomly placed). If the data.txt is organized before, it could save a lot of memory and time for processing	true
temp_dir	str	Required	Directory where the temporary files will be stored. An empty folder will be created if it doesn't exists.	"../Temp/sn-m3C_1Mb"
genome_reference_path	str	Required	Path of the genome reference file from USCS Genome Browser, will be used to generate bin nodes	"../hg19.chrom.sizes.txt"
cytoband_path	str	Required	Path of the cytoband reference file from USCS Genome Browser, will be used to remove centromere regions	"../cytoBand_hg19.txt"
coassay	bool	Optional	Using co-assayed signals or not	true
coassay_signal	str	Optional	Name of the co-assayed signals in the hdf5 file to use (can be empy)	"meth_cg-100kb-cg_rate"
batch_id	str	Optional	The name of the batch id information stored in `label_info.pickle`. The corresponding information would be used to remove batch effects	"batch id"
library_id	str	Optional	Similar to the batch_id. The difference is that, batch_id assumes the cell type composition of different batches are similar, while library_id don't have that assumption. (Such as Ramani et al. and 4DN sci-Hi-C)	"batch id"
bulk_path	str	Optional	Path of the bulk Hi-C file (mcool format), can be used when calculating the projection matrix for scA/B	"/bulkHiC/4DNFIYGPDLKF_C28.mcool"

Note: It is recommended to check if there are strong batch effects in the dataset in the first place before using the batch effects removal function of Higashi.

Training process related parameters

params	Type	Required/Optional	description	example
chrom_list	str	Required	List of chromosomes to train the model on. The name convention should be the same as the data.txt and the genome_reference file	["chr1", "chr2","chr3","chr4","chr5"]
resolution	int	Required	Resolution for imputation.	1000000
resolution_cell	int	Required	Resolution for generate attributes of the cell nodes. Recommend to use 1Mb (data with lower coverage per cell) or 500Kb (data with higher coverage per cell).	1000000
local_transfer_range	int	Required	Number of neighboring bins in 1D genomic distance to consider during imputation (similar to the window size of linear convolution)	1
dimensions	int	Required	Embedding dimensions	64,
loss_mode	str	Required	Train the model in classification or ranking (can be either classification, rank, or zinb (zero-inflated negative binomial, Recommended))	zinb
rank_thres	int	Required	Difference of ground truth values that are larger than rank_thres would be considered as stable order.	1
embedding_epoch	int	Optional	Number of epochs to train to generate embeddings. When this parameters is not included, Higashi program would train 60 epochs in this period as default.	80
no_nbr_epoch	int	Optional	Number of epochs to train Higashi without neighbor information. When this parameters is not included, Higashi program would train 45 epochs in this period as default.	80
with_nbr_epoch	int	Optional	Number of epochs to train Higashi with neighbor information used. When this parameters is not included, Higashi program would train 30 epochs in this period as default.	60

Note: It takes different number of epochs for Higashi to converge on different datasets. All datasets we tested in the paper takes less than 60 epochs. Also, Higashi saves trained embeddings every epoch (the location can be found here). When you see that the embeddings give satisfying results, feel free the stop the Higashi program. And then start it again with the option -s 2 (See detailed explanation of this option in Step 3). Higashi would load the trained model from last time and continue training to save time.

Output related parameters

params	Type	Required/Optional	description	example
embedding_name	str	Required	Name of embedding vectors to store	"exp1"
impute_list	int	Required	List of chromosome to impute (must appear in the chrom list above)	["chr1"]
minimum_distance	int	Required	Minimum genomic distance between a pair of genomc bins to impute (bp)	1000000
maximum_distance	int	Required	Maximum genomic distance between a pair of genomc bins to impute (bp, -1 represents no constraint)	-1
neighbor_num	int	Required	Number of neighboring cells to incorporate when making imputation, the hyperparameter `k` in the manuscript	5
correct_be_impute	bool	Optional	Whether taking batch effects into account and try to remove batch effects when imputing. When set as true, `batch_id` parameter must be included.	false
impute_verbose	int	Optional	Verbosity level of imputation process. When set as a positive int $n$, the program will print information every $n$ cells. When set as 0 or negative int, the program won't print the information. When not included, it will be set as 10 by default.	10

Computational resources related parameters

params	Type	Required/Optional	description	example
cpu_num	int	Required	Higashi is optimized for multiprocessing. Limit the number of cores to use with this param. -1 represents use all available cpu.	-1
gpu_num	int	Required	Higashi is optimized to utilize multiple gpus for computational efficiency. Higashi won't use all these gpus throughout the time. For co-assayed data, it would use multiple gpus in the processing step. For all data, Higashi would train and impute scHi-C on different gpus for computational efficiency. This parameters should be non negative.	8

Note: The cpu_num and gpu_num do not necessarily correspond to the physical number of cpu cores or gpu cards. They actually refers to how many parallel threads are used.

Visualization related parameters

params	Type	Required/Optional	description	example
UMAP_params	dict	Optional	Parameters that'll be passed to Higashi-vis. Higashi-vis will use these parameters when calculating UMAP visualization. Follow the naming convention of the package umap	{"n_neighbors": 30, "min_dist": 0.3
TSNE_params	dict	Optional	Similar to UMAP_params. Follow the naming convention of tsne in sklearn	{"n_neighbors": 15}
random_walk	bool	Optional	Whether run linear_convolution and randomwalk-with-restart at the processing part for visualization. Code adapted from scHiCluster. Do not recommend when resolution goes higher than 100Kb. When not included, it will be set as false in default.	false
vis_palette	dict	Optional	Custom palette for a specific label_info.	{"cluster label": {"L23": "#e51f4e", "L4": "#45af4b", "L5": "#ffe011", "L6": "#0081cc", "Ndnf": "#ff7f35", "Vip": "#951eb7", "Pvalb": "#4febee"}}