4.2.7 DeepSTARR2 - WangLabTHU/GPro GitHub Wiki

hcwang and qxdu edited on Aug 4, 2023, 1 version

DeepSTARR2 Model Architecture

DeepSTARR2 model [2] is a CNN regression model for predicting enhancer activity, proposed in paper [1]. DeepSTARR2 can predicts enhancer activity quantitatively for new sequences and reveals different biological grammars, i.e., flanking sequences and motif interactions, allowing the design of synthetic enhancers with desired activity levels de novo.

Here, in order to facilitate users' understanding of the process of the model in biological sequences, the author has provided a more detailed operational pipeline. There may be minor variations in few parameters.

In paper [2], a pretrained DeepSTARR2 model has been utilized for transfer learning, enabling the binary classification of tissue-specific enhancers. Additionally, we introduced a binary classifier based on a pretrained regression model, offering users the option to utilize the classifier with either the pretrained DeepSTARR2 model or from scratch. A demo for transfer learning have been provided in demo6.

All parameters should be defined during the initialization phase. We have encapsulated the source code, thus all predictive models have unified input and output parameters. There are two types of parameters, one should be defined during the initialization phase Initialization, and the other should be defined during the training/sampling phase Training/Predicting.

Params for DeepSTARR2 regression model

Initialization params

params	description	default value
batch_size	training batch size	64
length	sequential length of the training dataset	1001
model_name	parameter that controls the saving path under "./checkpoints"	deepstarr2
epoch	training epochs	200
patience	earlystopping when the indicators no longer change	50
log_steps	logging the output/criterias of the model every print_epoch epochs	10
save_steps	saving the result of model every save_epoch epochs	20
exp_mode	the processing mode for expression input	direct

Training/Predicting params

params	description	default value	flexible stage
dataset	training dataset sequences path, fasta file	None	`train()`
labels	training dataset expression path, txt file, each line an expression corresponding to dataset	None	`train()`
savepath	final model saving path directory	None	`train()`
model_path	model loading directory	None	`predict()`/`predict_input()`
data_path	dataset to be predicted , fasta file	None	`predict()`
inputs	data for predict_input, can be datapath, sequence list or onehot encoded data	None	`predict_input()`
mode	input mode for predict_input, can be "path","data" or "onehot"	"path"	`predict_input()`

Caution: predict() function will directly generate samples in checkpoint path, but predict_input() will not generate the file automatically.

Spectial Function: `train_with_valid()`

This function is designed for demo and allows for the use of a predetermined validation set for validation without splitting the training dataset.

params	description
train_dataset	training dataset sequences path, fasta file
train_labels	training dataset expression path, txt file, each line an expression corresponding to dataset
valid_dataset	validation dataset sequences path, fasta file
valid_labels	validation dataset expression path, txt file, each line an expression corresponding to dataset
savepath	final model saving path directory

Params for DeepSTARR2_binary classification model

DeepSTARR2_binary actually consists of the first N-1 layers of the DeepSTARR2 architecture, with an additional new linear layer and a sigmoid normalizer. For transfer learning, users can provide a pretrained DeepSTARR2 model as the starting point of model parameters.

Spectial Function: `train_with_valid()`

params	description
train_dataset	training dataset sequences path, fasta file
train_labels	training dataset expression path, txt file, each line an expression corresponding to dataset
valid_dataset	validation dataset sequences path, fasta file
valid_labels	validation dataset expression path, txt file, each line an expression corresponding to dataset
savepath	final model saving path directory
transfer	`False` default. If set `True`, model will start transfer learning with the pretrained model in `modelpath`
modelpath	`None` default, controls the model path for transfer learning

Spectial Function: `predict_without_model()`

We directly utilized the pretrained DeepSTARR2 with randomly initialized new layers for classification, without any fine-tune process, which served as a function for ablation validation.

params	description
modelpath	controls the model path for transfer learning
inputs	data for predict_input, can be datapath, sequence list or onehot encoded data
mode	input mode for predict_input, can be "path","data" or "onehot"

Demos

Demos for DeepSTARR2

A demo for model training/predicting is described below:

from gpro.predictor.deepstarr2.deepstarr2 import DeepSTARR2_language

model = DeepSTARR2_language(length=1001, epoch=400, patience=10)

# Train
default_root = "your working directory"
train_dataset = os.path.join(default_root, 'data/Accessibility_models_training_data/Train_seq.txt')
train_labels  = os.path.join(default_root, 'data/Accessibility_models_training_data/Train_exp.txt')
valid_dataset = os.path.join(default_root, 'data/Accessibility_models_training_data/Val_seq.txt')
valid_labels =  os.path.join(default_root, 'data/Accessibility_models_training_data/Val_exp.txt')
save_path = os.path.join(default_root, 'checkpoints/')
model.train_with_valid(train_dataset=train_dataset,train_labels=train_labels,valid_dataset=valid_dataset, valid_labels=valid_labels, savepath=save_path)

# Predict
model_path = os.path.join(default_root, "checkpoints/deepstarr2/checkpoint.pth")
data_path =  os.path.join(default_root, "data/example.txt")
model.predict(model_path=model_path, data_path=data_path)

# Predict input
res = model.predict_input(model_path=model_path, inputs=data_path)
print(res)

Demos for DeepSTARR2_binary

from gpro.predictor.deepstarr2.deepstarr2_binary import DeepSTARR2_binary_language

# Train (Transfer Learning)

model = DeepSTARR2_binary_language(length=1001, epoch=200, patience=20, model_name="deepstarr2_binary")
default_root = "your working directory"
train_dataset = os.path.join(default_root, 'data/EnhancerActivity_models_training_data/Train_seq.txt')
train_labels  = os.path.join(default_root, 'data/EnhancerActivity_models_training_data/Train_exp.txt')
valid_dataset = os.path.join(default_root, 'data/EnhancerActivity_models_training_data/Val_seq.txt')
valid_labels =  os.path.join(default_root, 'data/EnhancerActivity_models_training_data/Val_exp.txt')
save_path =  os.path.join(default_root, 'checkpoints/')
model_path =  os.path.join(default_root, 'checkpoints/deepstarr2/checkpoint.pth')

model.train_with_valid(train_dataset=train_dataset,train_labels=train_labels,valid_dataset=valid_dataset, valid_labels=valid_labels, savepath=save_path, transfer=True, modelpath=model_path)

# Train (Random Intialization)

model = DeepSTARR2_binary_language(length=1001, epoch=200, patience=20, model_name="deepstarr2_random")
model.train_with_valid(train_dataset=train_dataset,train_labels=train_labels,valid_dataset=valid_dataset, valid_labels=valid_labels, savepath=save_path)

# Predict
model_path = os.path.join(default_root, "checkpoints/deepstarr2/checkpoint.pth")
data_path =  os.path.join(default_root, "data/example.txt")
model.predict(model_path=model_path, data_path=data_path)

# Predict input
res = model.predict_input(model_path=model_path, inputs=data_path)
print(res)

Caution: The prediction value is the probability in [0,1], we also provided a simple function proba_to_label in deepstarr2_binary.py, it will simply turn numbers greater than 0.5 to 1 and numbers less than 0.5 to 0.

Citations

[1] de Almeida B P, Reiter F, Pagani M, et al. DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers[J]. Nature Genetics, 2022, 54(5): 613-624.
[2] de Almeida B P, Schaub C, Pagani M, et al. Targeted design of synthetic enhancers for selected tissues in the Drosophila embryo[J]. Nature, 2023: 1-2.

4.2.7 DeepSTARR2 - WangLabTHU/GPro GitHub Wiki

DeepSTARR2 Model Architecture

Params for DeepSTARR2 regression model

Initialization params

Training/Predicting params

Spectial Function: train_with_valid()

Params for DeepSTARR2_binary classification model

Spectial Function: train_with_valid()

Spectial Function: predict_without_model()

Demos

Demos for DeepSTARR2

Demos for DeepSTARR2_binary

Citations

Spectial Function: `train_with_valid()`

Spectial Function: `train_with_valid()`

Spectial Function: `predict_without_model()`