9. Customization - WangLabTHU/GPro GitHub Wiki
hcwang and qxdu edited on Aug 4, 2023, 1 version
If you wish to support our work, you can submit your own pull request, or you can design your own model in the gpro source code. Here we provide a universal interface for the predictor (regression)
, assuming the new model is named MyModel.
Design your own predictor
Creating your own predictor under gpro/predictor/mymodel/mymodel.py
folder (https://github.com/WangLabTHU/GPro/tree/main/gpro/predictor). The initial code block should be consistent:
import os
import sys
import functools
import numpy as np
import torch.nn.functional as F
from collections import OrderedDict
import torch
from torch import nn
from tqdm import tqdm
from scipy.stats import pearsonr
from torch.utils.data import DataLoader, Dataset
from ...utils.utils_predictor import EarlyStopping, seq2onehot, open_fa, open_exp
class SequenceData(Dataset):
def __init__(self,data, label):
self.data = data
self.target = label
def __getitem__(self, index):
return self.data[index], self.target[index]
def __len__(self):
return self.data.size(0)
def __getdata__(self):
return self.data, self.target
class TestData(Dataset):
def __init__(self,data):
self.data = data
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.data.size(0)
def __getdata__(self):
return self.data
Then define your own MyModel
class, our default input tensor shape is [batch_size, 4, sequence_length]
, you can define your own input tensor shape, and fine-tune the permute(0,2,1)
function below.
class MyModel(nn.Module):
def __init__(self, ...):
super(MyModel, self).__init__()
...
def forward(self, x):
return output
Then define your own MyModel_language
as the final class for training and predicting. You can customize the default parameters for initialization.
class MyModel_language:
def __init__(self,
length,
batch_size = 64,
model_name = "mymodel",
epoch = 200,
patience = 50,
log_steps = 10,
save_steps = 20,
exp_mode = "log2"
):
self.model = CNN_K15(input_length=length)
self.model_name = model_name
self.batch_size = batch_size
self.epoch = epoch
self.patience = patience
self.seq_len = length
self.log_steps = log_steps
self.save_steps = save_steps
self.device, = [torch.device("cuda" if torch.cuda.is_available() else "cpu"), ]
self.exp_mode = exp_mode
We provide default functions train()
, predict()
and predict_input()
. You can directly migrate these functions to your code block without any modifications:
train()
:
def train(self, dataset, labels, savepath):
self.dataset = dataset
self.labels = labels
self.checkpoint_root = savepath
filename_sim = self.checkpoint_root + self.model_name
if not os.path.exists(filename_sim):
os.makedirs(filename_sim)
early_stopping = EarlyStopping(patience=self.patience, verbose=True,
path=os.path.join(filename_sim, 'checkpoint.pth'), stop_order='max')
total_feature = open_fa(self.dataset)
total_feature = seq2onehot(total_feature, self.seq_len)
total_label = open_exp(self.labels, operator=self.exp_mode)
total_feature = torch.tensor(total_feature, dtype=float) # (sample num,length,4)
total_label = torch.tensor(total_label, dtype=float) # (sample num)
total_length = int(total_feature.shape[0])
r = int(total_length*0.7)
train_feature = total_feature[0:r,:,:]
train_label = total_label[0:r]
valid_feature = total_feature[r:total_length,:,:]
valid_label = total_label[r:total_length]
train_dataset = SequenceData(train_feature, train_label)
train_dataloader = DataLoader(dataset=train_dataset,
batch_size=self.batch_size, shuffle=True)
valid_dataset = SequenceData(valid_feature, valid_label)
valid_dataloader = DataLoader(dataset=valid_dataset,
batch_size=self.batch_size, shuffle=True)
train_log_filename = os.path.join(filename_sim, "train_log.txt")
train_model_filename = os.path.join(filename_sim, "checkpoint.pth")
print("results saved in: ", filename_sim)
device, = [torch.device("cuda" if torch.cuda.is_available() else "cpu"),]
model = self.model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)
criterion = torch.nn.HuberLoss(reduction='mean')
for epoch in tqdm(range(0,self.epoch)):
model.train()
train_epoch_loss = []
for idx,(feature,label) in enumerate(train_dataloader,0):
feature = feature.to(torch.float32).to(device).permute(0,2,1)
label = label.to(torch.float32).to(device)
outputs = model(feature)
optimizer.zero_grad()
loss = criterion(label.float(),outputs.flatten())
loss.backward()
optimizer.step()
train_epoch_loss.append(loss.item())
model.eval()
valid_exp_real = []
valid_exp_pred = []
for idx,(feature,label) in enumerate(valid_dataloader,0):
feature = feature.to(torch.float32).to(device).permute(0,2,1)
label = label.to(torch.float32).to(device)
outputs = model(feature)
valid_exp_real += label.float().tolist()
valid_exp_pred += outputs.flatten().tolist()
coefs = np.corrcoef(valid_exp_real,valid_exp_pred)
coefs = coefs[0, 1]
test_coefs = coefs
print("real expression samples: ", valid_exp_real[0:5])
print("pred expression samples: ", valid_exp_pred[0:5])
print("current coeffs: ", test_coefs)
cor_pearsonr = pearsonr(valid_exp_real, valid_exp_pred)
print("current pearsons: ",cor_pearsonr)
## Early Stopping Step
early_stopping(val_loss=test_coefs, model=self.model)
if early_stopping.early_stop:
print('Early Stopping......')
break
if (epoch%self.log_steps == 0):
to_write = "epoch={}, loss={}\n".format(epoch, np.average(train_epoch_loss))
with open(train_log_filename, "a") as f:
f.write(to_write)
if (epoch%self.save_steps == 0):
torch.save(model.state_dict(), train_model_filename)
predict()
:
def predict(self, model_path, data_path):
model_path = os.path.dirname(model_path)
path_check = '{}/checkpoint.pth'.format(model_path)
path_seq_save = '{}/seqs.txt'.format(model_path)
path_pred_save = '{}/preds.txt'.format(model_path)
device, = [torch.device("cuda" if torch.cuda.is_available() else "cpu"),]
model = self.model.to(device)
model.load_state_dict(torch.load(path_check))
model.eval()
seq_len = self.seq_len
test_feature = open_fa(data_path)
test_seqs = test_feature
test_feature = seq2onehot(test_feature, seq_len)
test_feature = torch.tensor(test_feature, dtype=float)
test_dataset = TestData(test_feature)
test_dataloader = DataLoader(dataset=test_dataset, batch_size = 128, shuffle=False)
test_exp_pred = []
for idx,feature in enumerate(test_dataloader,0):
feature = feature.to(torch.float32).to(device).permute(0,2,1)
outputs = model(feature)
pred = outputs.flatten().tolist()
test_exp_pred += pred
## Saving Seqs
f = open(path_seq_save,'w')
i = 0
while i < len(test_seqs):
f.write('>' + str(i) + '\n')
f.write(test_seqs[i] + '\n')
i = i + 1
f.close()
## Saving pred exps
f = open(path_pred_save,'w')
i = 0
while i < len(test_exp_pred):
f.write(str(np.round(test_exp_pred[i],2)) + '\n')
i = i + 1
f.close()
predict_input()
:
def predict_input(self, model_path, inputs, mode="path"):
model_path = os.path.dirname(model_path)
path_check = '{}/checkpoint.pth'.format(model_path)
device, = [torch.device("cuda" if torch.cuda.is_available() else "cpu"),]
model = self.model.to(device)
model.load_state_dict(torch.load(path_check))
model.eval()
seq_len = self.seq_len
if mode=="path":
test_feature = open_fa(inputs)
test_feature = seq2onehot(test_feature, seq_len)
elif mode=="data":
test_feature = seq2onehot(inputs, seq_len)
elif mode=="onehot":
test_feature = inputs
test_feature = torch.tensor(test_feature, dtype=float)
test_dataset = TestData(test_feature)
test_dataloader = DataLoader(dataset=test_dataset, batch_size = 128, shuffle=False)
exp = []
for idx,feature in enumerate(test_dataloader,0):
feature = feature.to(torch.float32).to(device).permute(0,2,1)
outputs = model(feature)
pred = outputs.flatten().tolist()
exp += pred
return exp
Ultimately, you can utilize your new model in the same way as training other prediction models.
Design your own metrics
Different researchers may require specific evaluation indicators for varying promoter contexts. We allow users to create their own new functions under folder gpro/evaluator/
(https://github.com/WangLabTHU/GPro/tree/main/gpro/evaluator). Here, we assume the creation of a new file myfunction.py
, and assume that you will evaluate the final design of sample.txt
with new metric newmetric
. Then myfunction.py
should like:
... # import steps
from ..utils.utils_evaluator import read_fa, seq2onehot
def newmetric(datapath=".../sample.txt", ...):
samples = read_fa(datapath)
... # evaluation steps
return metric
Then you can reinstall the gpro package, and import the new metric like:
from gpro.evaluator.myfunction import newmetric
This will enable you to conveniently reuse your new metrics. Additionally, if you have new broadly applicable metrics, or influential metrics that we haven't covered yet, we welcome you to supplement us through the following steps.
Pull request
You can also submit your own encapsulated new generator/predictor/optimizer/evaluator to us, to be merged into gpro via a pull request, which will assist in replicating your work by others. Here are some requirements.
- If the new model has a specific paper reference, please provide a DOI for us to cite.
- Please provide at least one model architecture diagram or a document similar to our wiki.
- The new model must be implemented in PyTorch.
We sincerely appreciate your contribution.