hcwang and qxdu edited on Aug 4, 2023, 1 version

Introduction

The Filter() class is the most traditional sequence optimization solution. After training the generator and predictor, users use the generator to generate large-scale new sequences, and then use the predictor to filter for the sequences that may have the best performance. The accuracy of this method has been fully verified. The following figure shows the differences between the sequences generated by the diffusion model and the wgan model compared to natural sequence:

At this point, we analyzed the new information focused by the predictor compared to natural sequences using MEME, and it can be seen that the preferences of the predictor are in line with the biological laws of natural Escherichia coli. Therefore, a good predictor can guide us to generate sequences with better performance

A designer can combine any Generative model and Prediction model mentioned above in pairs. We can use a designer to perform pipelines like:

generator/predictor training -> sampling -> optimizing -> feedback or output

Input Parameters

Initialization params

params	description	default value
generator	generator model class	None
predictor	predictor model class	None
generator_modelpath	trained model path of generator	None
predictor_modelpath	trained model path of predictor	None
natural_datapath	natural sequences datapath	None
sample_number	default sampling scale at each epoch	None
savepath	final results saving directory	None

Running params

params	description	default value
MaxEpoch	sample_number will be replicated for MaxEpoch times	10
MaxPoolsize	length of final selecting results	2000

Demo

Before executing optimizer, you should have trained a generator and a predictor.

A simple demo will work like:

from gpro.optimizer.model_driven.filter import Filter

# (1) define the generator
from gpro.generator.diffusion.diffusion import Diffusion_language
default_root = "your working directory"
generator = Diffusion_language(length=50)
generator_modelpath = os.path.join(str(default_root), 'checkpoints/diffusion/')

# (2) define the predictor
from gpro.predictor.cnn_k15.cnnk15 import CNN_K15_language
predictor = CNN_K15_language(length=50)
predictor_modelpath = os.path.join(default_root), 'checkpoints/cnn_k15/checkpoint.pth')

# (3) select the highly-expressed sequence
natural_datapath = default_root + '/data/diffusion_prediction/seq.txt'

tmp = Filter(generator=generator, predictor = predictor, generator_modelpath=generator_modelpath, predictor_modelpath=predictor_modelpath,
                 natural_datapath=natural_datapath, savepath="./optimization/Filter")

tmp.run()

This program means you will get top sample_number sequences generated by diffusion model, selected by CNN K15 with its predicted expression values.

Results

Resulting files consists of compared_with_natural.pdf, ExpIter.txt, ExpIter.csv

files	description
compared_with_natural.pdf	Box plot comparing model generated results with natural results
ExpIter.txt	Save the FASTA file for the final result sequence
ExpIter.csv	Save the sequences and predictions for the final result sequence

A box plot is shown below.

4.3.1 Filter - WangLabTHU/GPro GitHub Wiki

Introduction

Input Parameters

Initialization params

Running params

Demo

Results

⚠️ GitHub.com Fallback ⚠️

4.3.1 Filter - WangLabTHU/GPro GitHub Wiki

Introduction

Input Parameters

Initialization params

Running params

Demo

Results

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️