3. Quick start - WangLabTHU/GPro GitHub Wiki

hcwang and qxdu edited on Aug 4, 2023, 1 version

Simple Usage

We provide a simple one-stop model training, data generation, and filtering script design.py for beginners here. For sequences with a length less than 150bp, we recommend using the following method, and models (wgan and cnn) involved have been published [1].

All parameters will be passed through argparse, and you can directly observe whether the program is running normally without passing any parameters.

./quickstart provide a template for data organization, training data should have the same format as ./quickstart/extdata, and final results should be like ./quickstart/results.

We highly recommend that you organize the original working directory in the following format, thus you can directly copy the design.py to your preferred directory and execute it.

/sampleFolder
    ├── dataset
    │   ├── seq.txt
    │   ├── exp.txt
    │   └── ...
    └── design.py

In fact, the only parameters you need to provide are seq.txt, exp.txt and sequence length. We will provide a detailed introduction to data requirements below:

Sequence file

Each line of this file should be a simple DNA sequence, and all sequences must be of equal length, as shown here. Currently, Fasta files are not supported as input directly. But you can use ./quickstart/utils.py to directly convert your fasta file to below format.

ATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATT
TAATTTTTATCTGTCTGTGCGCTATGCCTATATTGGTTAAAGTATTTAGT
AATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATAGGC
CATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAA
GCACCAATGAGCGTACCTGGTGCTTGAGGATTTCCGGTATTTTTAATCAG

Expression file

Each line of this file should be a floating number, corresponding with seq.txt.

10.93
5.2
11.2
10.3
7.67

To ensure the accuracy of the predictor training, our program will automatically take log2 (expression) for training. The file we provided here has already taken the log2, so exp_mode will be set to "direct".

You can also modify the exp_mode(see predictor initialization) to specify your normalization method.

Execution

The execution command of the program:

python design.py -seq seq_path -exp exp_path -length seq_len

where seq_path indictes the file that contain the promoter sequences, exp_path indicates the file that contain the corresponding gene expression of sequences, and seq_len means the length of promoter sequences (The synthetic promoters crafted by GPro will match the length of the sequence provided in the 'seq_path' file)

Here, for the default execution, seq_path is ./extdata/seq.txt, exp_path is ./extdata/exp.txt, and seq_len is 50. The default setting replicates the promoter design process from the paper "synthetic promoter design in escherichia coli based on a deep generative network". The code would generate 50bp E.coli constitutive synthetic promoters when finished.

In the default setting, we could execute:

python design.py

The following information will appear when the program is running:

Step1: Start Generator Training
...
Generator training finish! Model has been saved in ./checkpoints/wgan


Step2: Start Predictor Training
...
Predictor training finish! Model has been saved in ./checkpoints/cnn_k15

Step3.1: Start Directly Selecting New Sequences
Step3.2: Start Performing Gradient-based Optimization
...
Optimization finish! Result has been saved in ./optimization/

Step4: Start Evaluating
...
Evaluating finish! Result has been saved in ./evaluation/

Results


/sampleFolder
    ├── checkpoints
    │   ├── cnn_k15
    │   ├── wgan
    ├── optimization
    │   ├── Filter
    │── └── Gradient
    ├── evaluation
    │   ├── kmer_WGAN.png
    │   ├── mutagenesis_CNNK15.png
    │   ├── regression_CNNK15.png
    │   ├── saliency_CNNK15.png
    │   ├── seqlogo_CNNK15.png
    │   ├── seqs.txt
    └── └── pred.txt

checkpoints folder contains checkpoints for generator and predictor, see guidance for CNNK15 and WGAN for more detailed explaination.

optimization folder contains the optimization trajectory for trained generator and predictor, and you will get the sequences that might maximum the activation of predictor, through gradient algorthm under Gradient folder (with sample scale equal to natural dataset). You can also get the samples randomly generated by generator and finally selected by predictor under Filter folder (with only 2000 samples by default).

An example of profile for gradient-based optimization will be provided in ./optimization/gradient/Expiter.csv, described as below:

,seqs,pred
0,TGTACAATAAAACGTTTCATGGTTTCGGAGAATCAACCATACTATACGCA,8.179143905639648
1,AGCTAAATTCTGGTCAGGAACTGTCGTCAACATTTGGTAAGTTTTGAAAT,8.97793197631836
2,AGGTTAGCATGCAATCTATTAATGAAGTGTAAAGTCAGTATAATTATGTC,7.922459602355957
3,TTTCAGTAATTTGAAGGGTAGTAAATCTGACTCACGTTAATCATCAAGTT,6.438961029052734
4,ATCGCCTGATTCAGGATTAGTAATGCTTGACCTTTCAGTATACGTAAACC,6.132134437561035
5,AAGTCATTCAACTGGTTCAATTCAGCCTTGCAGATATTAGGATTTAAAGC,7.5647101402282715

An example of predicted expression level of optimized sequences versus natural will be provided in ./optimization/gradient/compared_with_natural.pdf, as below:

evaluation folder contains all criteria in Evaluator part. For example, the seqlogo of predictor is shown in the following figure :

With analysis above, we have completed the entire process of sequence design guided by machine learning. Please carefully review the generated files and subsequent documents to adjust and redesign the process that is suitable for your task.

Estimated runtime: Under NVIDIA 3060, the entire process will take half an hour

Citations

[1] Ye Wang and others, Synthetic promoter design in Escherichia coli based on a deep generative network, Nucleic Acids Research, Volume 48, Issue 12, 09 July 2020, Pages 6403–6412, https://doi.org/10.1093/nar/gkaa325

[2] Hoogeboom E, Nielsen D, Jaini P, et al. Argmax flows and multinomial diffusion: Learning categorical distributions[J]. Advances in Neural Information Processing Systems, 2021, 34: 12454-12465.

[3] Vaishnav, E.D., de Boer, C.G., Molinet, J. et al. The evolution, evolvability and engineering of gene regulatory DNA. Nature 603, 455–463 (2022). https://doi.org/10.1038/s41586-022-04506-6
⚠️ **GitHub.com Fallback** ⚠️