4.3.5 Feedback - WangLabTHU/GPro GitHub Wiki

hcwang and qxdu edited on Aug 4, 2023, 1 version

Introduction

The Feedback strategy was first proposed in the Feedback GAN[1], which continuously replaces the training set with newly generated sequences that can be predicted to be highly expressed. So far, this remains an important algorithm in the field of adaptive machine learning. The accuracy of this method has been fully verified. The following figure shows its workflow diagram:

We provide a simplified algorithm for both WGAN and Diffusion.

Input Parameters

Initialization params

params description default value
generator generator model class None
predictor predictor model class None
predictor_modelpath trained model path of predictor None
natural_datapath natural sequences datapath None
sample_number default sampling scale at each epoch 1000
savepath final results saving directory None

Running params

params description default value
MaxEpoch sample_number will be replicated for MaxEpoch times 50
MaxPoolsize length of final selecting results 1000
MaxIter the feedback steps will be replicated for MaxIter times 20

Demo

Before executing optimizer, you should have trained a generator and a predictor.

A simple demo will work like:

from gpro.optimizer.model_driven.feedback import Feedback

# (1) define the generator
from gpro.generator.diffusion.diffusion import Diffusion_language
default_root = "your working directory"
generator = Diffusion_language(length=50)

# (2) define the predictor
from gpro.predictor.cnn_k15.cnnk15 import CNN_K15_language
predictor = CNN_K15_language(length=50)
predictor_modelpath = os.path.join(default_root), 'checkpoints/cnn_k15/checkpoint.pth')

# (3) select the highly-expressed sequence
natural_datapath = default_root + '/data/diffusion_prediction/seq.txt'

tmp = Feedback(generator=generator, predictor=predictor, 
                   predictor_modelpath=predictor_modelpath, sample_number=1000,
                   natural_datapath=natural_datapath, savepath="./optimization/Feedback")

tmp.run()

This program means you will get top sample_number sequences generated by diffusion model, selected by CNN K15 with its predicted expression values.

Results

Resulting files consists of compared_with_natural.pdf, ExpIter.txt, ExpIter.csv, /checkpoints, /plot, traj and linechart.png

files description
compared_with_natural.pdf Box plot comparing model generated results with natural results
ExpIter.txt Save the FASTA file for the final result sequence
ExpIter.csv Save the sequences and predictions for the final result sequence
checkpoints Folder contains the checkpoint of retraining steps
plot plot the histogram of natural and model-driven results for comparison
traj save the sampling results at each retraining step
linechart.png plot the linechart of mean predicted expressions

A box plot is shown below:

The transformation process of the histogram during the feedback-training process is as follows:

A linechart is shown below:

Citations

[1] Gupta, A., Zou, J. Feedback GAN for DNA optimizes protein functions. *Nat Mach Intell* **1**, 105–111 (2019). 
⚠️ **GitHub.com Fallback** ⚠️