4.2.1 CNN K15 - WangLabTHU/GPro GitHub Wiki

hcwang and qxdu edited on Aug 4, 2023, 1 version

CNN K15 Model Architecture

CNN K15 is a predicted model in the k1.5 virus promoter, using a mutation set of wild-type sequences with a sequence length of 31 bp. This model is particularly suitable for short sequences (~50bp) and has excellent performance. We will sooner upload the relevant papers to biorxiv.

Here, in order to facilitate users' understanding of the process of the model in biological sequences, we provide a more detailed operational pipeline.

Input Parameters

All parameters should be defined during the initialization phase. We have encapsulated the source code, thus all predictive models have unified input and output parameters. There are two types of parameters, one should be defined during the initialization phase (Initialization), and the other should be defined during the training/sampling phase (Training/Predicting).

Initialization params

params description default value
batch_size training batch size 64
length sequential length of the training dataset 50
model_name parameter that controls the saving path under "./checkpoints" cnn_k15
epoch training epochs 200
patience earlystopping when the indicators no longer change 50
log_steps logging the output/criterias of the model every print_epoch epochs 10
save_steps saving the result of model every save_epoch epochs 20
exp_mode the processing mode for expression input log2

Training/Predicting params

params description default value flexible stage
dataset training dataset sequences path, fasta file None train()
labels training dataset expression path, txt file, each line an expression corresponding to dataset None train()
savepath final model saving path directory None train()
model_path model loading directory None predict()/predict_input()
data_path dataset to be predicted , fasta file None predict()
inputs data for predict_input, can be datapath, sequence list or onehot encoded data None predict_input()
mode input mode for predict_input, can be "path","data" or "onehot" "path" predict_input()

Caution: predict() function will directly generate samples in checkpoint path, but predict_input() will not generate the file automatically.

Demo

A demo for model training/predicting is described below:

from gpro.predictor.cnn_k15.cnn_k15 import CNN_K15_language
model = CNN_K15_language(length=50, epoch=400, patience=10)

# Train
default_root = "your working directory"
dataset = os.path.join(default_root, 'data/seq.txt')
labels  = os.path.join(default_root, 'data/exp.txt')
save_path = os.path.join(default_root, 'checkpoints/')
model.train(dataset=dataset,labels=labels,savepath=save_path)

# Predict
model_path = os.path.join(default_root, "checkpoints/cnn_k15/checkpoint.pth")
data_path =  os.path.join(default_root, "data/example.txt")
model.predict(model_path=model_path, data_path=data_path)

# Predict input
res = model.predict_input(model_path=model_path, inputs=data_path)
print(res)

Citations

⚠️ **GitHub.com Fallback** ⚠️