4.3.1 Filter - WangLabTHU/GPro GitHub Wiki
hcwang and qxdu edited on Aug 4, 2023, 1 version
The Filter()
class is the most traditional sequence optimization solution. After training the generator and predictor, users use the generator to generate large-scale new sequences, and then use the predictor to filter for the sequences that may have the best performance. The accuracy of this method has been fully verified. The following figure shows the differences between the sequences generated by the diffusion model and the wgan model compared to natural sequence:
At this point, we analyzed the new information focused by the predictor compared to natural sequences using MEME, and it can be seen that the preferences of the predictor are in line with the biological laws of natural Escherichia coli. Therefore, a good predictor can guide us to generate sequences with better performance
A designer can combine any Generative model and Prediction model mentioned above in pairs. We can use a designer to perform pipelines like:
generator/predictor training -> sampling -> optimizing -> feedback or output
params | description | default value |
---|---|---|
generator | generator model class | None |
predictor | predictor model class | None |
generator_modelpath | trained model path of generator | None |
predictor_modelpath | trained model path of predictor | None |
natural_datapath | natural sequences datapath | None |
sample_number | default sampling scale at each epoch | None |
savepath | final results saving directory | None |
params | description | default value |
---|---|---|
MaxEpoch | sample_number will be replicated for MaxEpoch times | 10 |
MaxPoolsize | length of final selecting results | 2000 |
Before executing optimizer, you should have trained a generator and a predictor.
A simple demo will work like:
from gpro.optimizer.model_driven.filter import Filter
# (1) define the generator
from gpro.generator.diffusion.diffusion import Diffusion_language
default_root = "your working directory"
generator = Diffusion_language(length=50)
generator_modelpath = os.path.join(str(default_root), 'checkpoints/diffusion/')
# (2) define the predictor
from gpro.predictor.cnn_k15.cnnk15 import CNN_K15_language
predictor = CNN_K15_language(length=50)
predictor_modelpath = os.path.join(default_root), 'checkpoints/cnn_k15/checkpoint.pth')
# (3) select the highly-expressed sequence
natural_datapath = default_root + '/data/diffusion_prediction/seq.txt'
tmp = Filter(generator=generator, predictor = predictor, generator_modelpath=generator_modelpath, predictor_modelpath=predictor_modelpath,
natural_datapath=natural_datapath, savepath="./optimization/Filter")
tmp.run()
This program means you will get top sample_number sequences generated by diffusion model, selected by CNN K15 with its predicted expression values.
Resulting files consists of compared_with_natural.pdf
, ExpIter.txt
, ExpIter.csv
files | description |
---|---|
compared_with_natural.pdf | Box plot comparing model generated results with natural results |
ExpIter.txt | Save the FASTA file for the final result sequence |
ExpIter.csv | Save the sequences and predictions for the final result sequence |
A box plot is shown below.