4.3.1 Filter - WangLabTHU/GPro GitHub Wiki

hcwang and qxdu edited on Aug 4, 2023, 1 version

Introduction

The Filter() class is the most traditional sequence optimization solution. After training the generator and predictor, users use the generator to generate large-scale new sequences, and then use the predictor to filter for the sequences that may have the best performance. The accuracy of this method has been fully verified. The following figure shows the differences between the sequences generated by the diffusion model and the wgan model compared to natural sequence:

At this point, we analyzed the new information focused by the predictor compared to natural sequences using MEME, and it can be seen that the preferences of the predictor are in line with the biological laws of natural Escherichia coli. Therefore, a good predictor can guide us to generate sequences with better performance

26

A designer can combine any Generative model and Prediction model mentioned above in pairs. We can use a designer to perform pipelines like:

generator/predictor training -> sampling -> optimizing -> feedback or output

Input Parameters

Initialization params

params description default value
generator generator model class None
predictor predictor model class None
generator_modelpath trained model path of generator None
predictor_modelpath trained model path of predictor None
natural_datapath natural sequences datapath None
sample_number default sampling scale at each epoch None
savepath final results saving directory None

Running params

params description default value
MaxEpoch sample_number will be replicated for MaxEpoch times 10
MaxPoolsize length of final selecting results 2000

Demo

Before executing optimizer, you should have trained a generator and a predictor.

A simple demo will work like:

from gpro.optimizer.model_driven.filter import Filter

# (1) define the generator
from gpro.generator.diffusion.diffusion import Diffusion_language
default_root = "your working directory"
generator = Diffusion_language(length=50)
generator_modelpath = os.path.join(str(default_root), 'checkpoints/diffusion/')

# (2) define the predictor
from gpro.predictor.cnn_k15.cnnk15 import CNN_K15_language
predictor = CNN_K15_language(length=50)
predictor_modelpath = os.path.join(default_root), 'checkpoints/cnn_k15/checkpoint.pth')

# (3) select the highly-expressed sequence
natural_datapath = default_root + '/data/diffusion_prediction/seq.txt'

tmp = Filter(generator=generator, predictor = predictor, generator_modelpath=generator_modelpath, predictor_modelpath=predictor_modelpath,
                 natural_datapath=natural_datapath, savepath="./optimization/Filter")

tmp.run()

This program means you will get top sample_number sequences generated by diffusion model, selected by CNN K15 with its predicted expression values.

Results

Resulting files consists of compared_with_natural.pdf, ExpIter.txt, ExpIter.csv

files description
compared_with_natural.pdf Box plot comparing model generated results with natural results
ExpIter.txt Save the FASTA file for the final result sequence
ExpIter.csv Save the sequences and predictions for the final result sequence

A box plot is shown below.

⚠️ **GitHub.com Fallback** ⚠️