Input Files - ma-compbio/Higashi GitHub Wiki

Both Higashi and Fast-Higashi supports the same input file format. All these input files should be put under the same directory. The path to this directory will be needed when configure the input files (See details in Usage).

  1. scHi-C dataset. The dataset can be provided to the program in either of the format, and specify how it is provided in the configuration file. (See details in Usage).
  • "higashi_v1" format: data.txt, a tab separated file with the following columns: ['cell_id', 'chrom1', 'pos1', 'chrom2', 'pos2', 'count'] (headers should be included in the first row). The meaning for these columns are:
    1. cell_id: (int), id for each cell, start from 0. Must be continuous (0, 1, 2, ..., #cell - 1). The order of its appearance does not need to be sorted in the file.
    2. chrom1: (str), chromosome name for fragment 1
    3. pos1: (int), location for fragment 1
    4. chrom2: (str), chromosome name for fragment 2
    5. pos2: (int), location for fragment 2. Note that only intra-chromosomal reads are used in Higashi. There is no need to guarantee that fragment 2 has larger coordinate value than fragment 1 as the following processing code would take care of it.
    6. count: (int or float), count number or normalized weight for the interaction.

The old cell_name column is deprecated due to redundancy. The corresponding info can be included in the label_info.pickle.

  • "higashi_v2" format (recommended): filelist.txt, a list of files that corresponds to the contact pair files. Each contact pair file correspond to one single cell, and is stored in the tab-delimited format. Each file must include the following columns ['chrom1', 'pos1', 'chrom2', 'pos2']. The 'count' column is optional, and when not provided would be assumed as 1. The header of each file is optional, and must be specified in the configuration file. (See details in Usage).

  1. label_info.pickle, a python pickle file of a dictionary storing labeled information of cells. If there is no labeled information, please create an empty dictionary in python and save it as a pickle. An example of the structure of the dictionary and how to save it as the pickle:
import pickle
output_label_file = open("label_info.pickle", "wb")
label_info = {
  'cell type': ['GM12878', 'K562', 'NHEK',.....,'GM12878'],
  'coverage':[12000, 14000, ...., 15000],
  'batch':['batch_1', 'batch_1',..., 'batch_2'],
  ...
}
pickle.dump(label_info, output_label_file)

The order of the labeled vector should be consistent with the cell_id column of the data.txt or the order of the filelist.txt.

If you have a pandas dataframe you can transform it into this label_info pickle by:

import pickle
output_label_file = open("label_info.pickle", "wb")
label_info = {k:np.asarray(df[k]) for k in df.columns}
pickle.dump(label_info, output_label_file)

Note: the key name cell_name_higashi is preserved for the specific purpose of associating cell_id to the cell name. The corresponding info would now be displayed in the Higashi_vis. (See demo at the right hand side):

  1. (Optional) sc_signal.hdf5, a hdf5 file for storing the coassayed signals. The structure of the hdf5 file:
.
├── signal1
│   ├── bin (information about entries in the signal 1 file. If the signal 1 is not based on genomic coordinates, left this option out)
│   │   ├── chrom
│   │   ├── start
│   │   └── end
│   ├── 0 (signals, size should be the same as signal1/bin/chrom)
│   ├── 1
│   └── 2
└── signal2
│   ├── ...
└──