4.1.1 WGAN - WangLabTHU/GPro GitHub Wiki

hcwang and qxdu edited on Aug 4, 2023, 1 version

Wasserstein GAN Model Architecture

Promoter design remains one of the most important considerations in metabolic engineering and synthetic biology applications. Theoretically, there are $4^{50}$ possible sequences for a 50-nt promoter, of which naturally occurring promoters make up only a small subset. To explore the vast number of potential sequences, wang et.al. used wgan model [1] for de novo promoter design in Escherichia coli. The model, which was guided by sequence features learned from natural promoters, could capture interactions between nucleotides at different positions and design novel synthetic promoters in silico. A schematic diagram of the whole workflpw has been provided.

Here, in order to facilitate users' understanding of the process of the model in biological sequences, we provide a more detailed operational pipeline. It should be noted that two models are used for training and descriminating. We also need to note that the results of WGAN are not stable enough.

Caution: we highly recommend that do not train wgan model for more than 12 epoch!

Input Parameters

We suggest that you define all parameters during the initialization phase. There are two types of parameters, one can only be defined during the initialization phase (Fixed), and the other can be redefined during the initialization or training/sampling phase (Flexible). However, in any case, a parameter can only be defined once.

Fixed params

params description default value
batch_size training batch size 32
netG_lr learning rate of Generator network $10^{-4}$
netD_lr learning rate of Discriminator network $10^{-4}$
num_epochs training epochs 12
print_epoch sampling the output of the model every print_epoch epochs 1
save_epoch aving the result of model every save_epoch epochs 1
Lambda parameter that controls the weight of gradients_penalty 10
length sequential length of the training dataset 50
model_name parameter that controls the saving path under "./checkpoints" wgan
seed random seed, only defined in generate() 0

Flexible params

params description default value flexible stage
dataset path of the training dataset None train()
savepath path for saving results None train()
sample_model_path path of the trained model None generate()
sample_number sampling number scale None generate()
sample_output path for saving samples None generate()

Demo

A demo for model training/sampling is described below:

from gpro.generator.wgan.wgan import WGAN_language
# model training
default_root = "your working directory"
dataset_path = os.path.join(str(default_root),'data/sequence_data.txt')
checkpoint_path = os.path.join(str(default_root), 'checkpoints/wgan/')
model = WGAN_language(length=50)
model.train(dataset=dataset_path, savepath=checkpoint_path)
# model sampling
sample_model_path = os.path.join(str(default_root), 'checkpoints/wgan/checkpoints/net_G_12.pth')
sample_number = 1000
model.generate(sample_model_path, sample_number)

After the training step, you will have a checkpoints and a training_log file under "checkpoint_path/model_name"; when you further perform sampling, you can also get a samples file that contains your samples.

/checkpoints/wgan/model_name
    ├── checkpoints
    │   ├── net_D_i.pth
    │   └── net_G_i.pth
    ├── samples
    └── training_log

The detailed information of the file is as follows:

checkpoints: contains net_G_xxx/net_D_xxx, means the params of generator/discriminator
training_log: gen_iter_xx.txt, a fasta file that contains the sample of model output at xx epoch.
samples: a fasta file that contains the final result of model sampling, which might be further used for biological experiments or sequence optimization.

Citations

[1] Ye Wang and others, Synthetic promoter design in Escherichia coli based on a deep generative network, Nucleic Acids Research, Volume 48, Issue 12, 09 July 2020, Pages 6403–6412, https://doi.org/10.1093/nar/gkaa325
⚠️ **GitHub.com Fallback** ⚠️