Higashi Usage - ma-compbio/Higashi GitHub Wiki

Data processing

Run the following commands to process the input data.

cd higashi
python Process.py [-c CONFIG]

'
required arguments: 
-c CONFIG             The path to the configuration JSON file that you created in the step
'

This script will finish the following tasks:

  • generate a dictionary that'll map genomic bin loci to the node id.
  • extract data from the data.txt and turn that into the format of hyperedges (triplets)
  • create contact maps based on sparse scHi-C for visualization, baseline model, and generate node attributes
  • run linear convolution + random-walk-with-restart (scHiCluster) to impute the contact maps as baseline and visualization
  • generate node attributes
  • (Optional) process co-assayed signals

Before each step is executed, a message would be printed indicating the progress, which helps the debugging process.

Train the Higashi model

python main_cell.py [-c CONFIG] [-s START]

'
optional arguments:
-s {1,2,3}            The start step of Higashi program. Can be used to continue Higashi 
                      training if interrupted before. 1,2,3 stands for the following steps: 
                      1. Train Higashi without cell-dependent GNN to force self-attention layers 
                      to capture the heterogeneity of chromatin structures
                      2. Train Higashi with cell-dependent GNN, but with k=0
                      3. Train Higashi with cell-dependent GNN, but with k=`neighbor_num` in the 
                      config JSON. When set as 1, the program would execute step 1,2,3 sequentially. 
                      When set as 2, the program would execute step 2,3 sequentially. (default: 1)

required arguments:
-c CONFIG             The path to the configuration JSON file that you created in the step 2
'

**Extra Notes: ** Higashi saves parameters of the model and embeddings every 5 epochs, the user can check if the embeddings look good in the process. For instance, the user is not sure how many epochs would Higashi converges on their new dataset and set the embedding_epoch as 120 just to be on the safe side. During the training process, the user find that the embeddings converge at around epoch 58. Instead of waiting for 120 epochs to finish, one can just wait till the model finished the 60 epoch (as the model saves parameter every 5 epochs), and interrupt the Higashi program. Then the user can restart Higashi with the option -s 2 to load pre-trained model and skip the first embedding generation training stage.