Output of Higashi main - ma-compbio/Higashi GitHub Wiki

The output path, naming convention, etc., are defined in the configuration file (See details in "configuration of parameters"). For instance, all output of Higashi is stored at the temp_dir of the configuration file. See tutorials for examples on how to use these output files.

Embedding vectors

The single cell embeddings are saved with the name /embed/{embedding_name}_0_origin.npy, where the row order is consistent with the cell_id column of the input file data.txt

Besides the cell embeddings, the embeddings for the genomic bins are also saved with the name {embedding_name}_{id}_origin.npy.

  • The {embedding_name} is the parameter in the configuration file.
  • The {id} starts at 0, ends with the number of chromosomes that is contained in the training data. 0 corresponds to the cell embeddings. 1 ~ corresponds to embeddings of bins from each chromosome.

The embeddings can be load with the standard np.load(xxx)

Imputed contact maps

figs/imputation_showcase.png

The imputed matrices are saved with name {chrom_name}_{embedding_name}_nbr_{k}_impute.hdf5.

  • The {chrom_name} is the chromosome name of the imputed maps.
  • The {embedding_name} is the parameter in the configuration file.
  • The {k} can be either 0 or the {neighbor_num} parameter specified in the configuration file. When {k}=0, it represents the imputation results without using any neighboring cell information.

The format of the imputed matrix is an hdf5 file with the structure

.
├── coordinates (vector size of k x 2)
├── cell 0 (vector size of k)
├── cell 1
├── ...
└── cell N

The matrix can be generated by putting the vector of cell * to the corresponding entries of the coordinates. For instance

import h5py
import numpy
with h5py.File(os.path.join(temp_dir, "%s_%s_nbr_0_impute.hdf5" % (chrom, embedding_name)), "r") as impute_f:
    coordinates = impute_f['coordinates']
    xs, ys = coordinates[:, 0], coordinates[:, 1]
    size = int(np.max(ys)) + 1
    cell_list = trange(len(list(impute_f.keys())) - 1)
    m1 = np.zeros((size, size))
    for i in cell_list:
        m1 *= 0.0
        proba = np.array(impute_f["cell_%d" % i])
        m1[xs.astype('int'), ys.astype('int')] += proba
        m1 = m1 + m1.T		

Transform imputation results to .cool files.

We provide a script that can select groups of cells, merge the imputation results and save as .cool files. Detailed documentation of the .cool format can be found at https://cooler.readthedocs.io/en/latest/schema.html?).

To do that run the following command

python Merge2Cool.py [-c CONFIG] [-o OUTPUT] [-l LIST_PATH] [-t LIST_TYPE] [-n] 

'
optional arguments:
-n, --neighbor        Create .cool files for imputed maps with neighboring cell information utilized.

required arguments:
-c CONFIG             The path to the configuration JSON file that you created in the step.
-o OUTPUT             The path and prefix of the output cool names. (example: ./output/test)
-l LIST_PATH          The path to a list. The file format for this list can be either .txt or .npy. You 
                      can specify what groups of cells you want to merge and output in two ways:
                      1. The list contains the `cell_id` of interest, e.g. [1,2,10,20,135,..,]. When 
                      doing so, the imputation results of these cells would be selected, merged and saved 
                      in a file named {OUTPUT}.cool.
                      2. The list contains the group information, e.g. [GM12878, K562, ..., K562]. 
                      When doing so, the imputation results of cells from each group would be selected, 
                      merged and saved in files such as {OUTPUT}_GM12878.cool, etc.
                      3. If this parameter is not passed, the program will create the merged imputed contact maps of all cells by default.
-t {selected, group}  `selected` represents the first way of specifying cells of interest, while `group` 
                      represents the second way.
'

With this script, one can create individual .cool files for each single cell by inputing a list that goes [cell_0, cell_1, cell_2, ..., cell_N] and specify the {LIST_TYPE} as group.