4.1.3 cGAN and VAE - WangLabTHU/GPro GitHub Wiki
hcwang and qxdu edited on Aug 4, 2023, 1 version
The conditional generation model aims to learn the potential joint distribution of data and labels to achieve conditional data generation. Here, we reproduce the latest article published by Haochen Wang et al. in Nature Communication and implement a demo of the design of constitutive and inducible promoters. A schematic diagram of the diffusion process has been provided.

The architecture of this model is different from our previous two models, please refer to the format of ecoli_mpra_3_laco.csv
. This only provides a reproduction of the paper work. If you wish to test data on a new test set, please refer to the paper in detail [1].
When initializing, the input parameters of the model are as follows:
params | description | default value |
---|---|---|
data_name | the saving tag of data | ecoli_mpra_3_laco |
model_name | the saving tag of model | deepseed_ecoli_mpra_3_laco |
seqL | length of input sequence | 165 |
dataset | path of the training dataset | None |
savepath | path for saving results | None |
n_iters | total training epoch | 10000 |
save_iters | saving the result of model every save_epoch epochs | 1000 |
When generating, the model parameters are as follows:
params | description | default value |
---|---|---|
input_file | file like input_promoters.txt , provide the format of the mask |
None |
sample_model_path | path of the trained model | None |
sample_output | whether output the sample file | True |
seed | sampled random number seed | 0 |
A demo for model training/sampling is described below. You can perform following program under demo/demo5
folder:
from gpro.generator.others.cgan.cgan import Deepseed
# training
model = Deepseed(n_iters=10000, save_iters=10000, dataset="./datasets/ecoli_mpra_3_laco.csv", savepath="./checkpoints")
model.train()
# sampling
model.generate(input_file = './datasets/input_promoters.txt', sample_model_path='./checkpoints/check/deepseed_ecoli_mpra_3_laco/net_G_9999.pth')
After the training step, you will have a cache and a check folder; when you further perform sampling, you can also get a samples file that contains your samples.
/checkpoints/cache/model_name
├── figure
│ └── 4-mer frequency
├── gen_iter
│ └── sampling at every save_iter
├── inducible
│ └── csv format of samles
└── training_log
└── training log file
The remaining model files and other settings are the same as before.
A VAE is an autoencoder whose encodings distribution is regularized during the training in order to ensure that its latent space has good properties allowing us to generate some new data. A schematic diagram of the diffusion process has been provided.
When initializing, the input parameters of the model are as follows[3]:

It should be noted that our VAE structure here is completely based on [2], and we have not yet conducted a comprehensive evaluation of the generated sequence quality, optimal training period, etc. Therefore, we cannot provide an accurate parameter table here. However, the format of the parameters should be consistent with the WGAN model.
You can import SimpleVAE class from gpro.generator.others.vae.vae
.
A simple demo for VAE training can be described below:
from gpro.generator.others.vae.vae import SimpleVAE
dataset_path = './datasets/sequence_data.txt'
checkpoint_path = './checkpoints'
model = SimpleVAE(length=50)
model.train(dataset=dataset_path, savepath=checkpoint_path)
model.generate(sample_model_path, sample_number, seed) # same with wgan and diffusion
[1] Zhang, P., Wang, H., Xu, H. et al. Deep flanking sequence engineering for efficient promoter design using DeepSEED. Nat Commun 14, 6309 (2023). https://doi.org/10.1038/s41467-023-41899-y
[2] Brookes D, Park H, Listgarten J. Conditioning by adaptive sampling for robust design[C]//International conference on machine learning. PMLR, 2019: 773-782.
[3] Linder J, Bogard N, Rosenberg A B, et al. A generative neural network for maximizing fitness and diversity of synthetic DNA and protein sequences[J]. Cell systems, 2020, 11(1): 49-62. e16.