4. Functions - WangLabTHU/GPro GitHub Wiki
hcwang and qxdu edited on Aug 4, 2023, 1 version
The Gpro package allows the biologist to discover novel functinal promoters in undiscovered sequence space, and to perform a wide variety of other quality-control analyses.
The Gpro package mainly provides new sequence discovery algorithms using both Wasserstein Generative Adversarial Networks(WGAN) and Multinomial Diffusion Networks(Diffusion), which have complementary strengths. This package also provides promoter strength prediction algorithms using existed models from various papers, supporting the assesment of both prokaryote and eukaryotic, and covering a range of currently popular models, like DenseNet, DenstNet + LSTM(DenseLSTM), CNN(CNN_K15, CNN_Wangye), Multihead Attention + BiLSTM(AtnnBilstm). These models support the discovery of promoter space. In addition, the Gpro Suite also provides Optimizers for searching sequences for optimization in expression level. Finally, You can also control the sequence quality and visually analyze model performance with different Evaluators.
The Gpro package is comprised of a collection of tools that work together, as shown below. To see what has changed recently you can peruse the release notes. During the training process of the predictor, 70% is used as the training set by default, and 30% is used as the validation set.
Apart from executing the process from initiation to completion, each module mentioned above has been individually packaged and can be imported individually.
Credits: all the evaluators refer to codes of BioAutoMATED([10]), an AutoML pipeline for biological samples.We have fine-tuned these codes to make them more suitable for our tasks.
4.1 Generator
Generator | Description | Import Method | Class Name | Citations |
---|---|---|---|---|
WGAN | Wasserstein Generative Adversarial Networks | gpro.generator.wgan.wgan | WGAN_language | [1] |
Diffusion | Multinomial Diffusion | gpro.generator.diffusion.diffusion | Diffusion_language | [2] |
cGAN | conditional Generative Adversarial Networks | gpro.generator.others.cgan.cgan | DeepSeed | [6] |
VAE | Variational Auto-Encoder | gpro.generator.others.vae.vae | SimpleVAE | [7] |
4.2 Predictor
Predictor | Description | Import Method | Class Name | Citations |
---|---|---|---|---|
CNN_K15 | CNN network for k1.5 virus promoter | gpro.predictor.cnn_k15.cnn_k15 | CNN_K15_language | |
CNN_Wangye | CNN network proposed by Wangye | gpro.predictor.cnn_wangye.cnn_wangye | WangYeModel_language | [1] |
DenseNet | Predictor based on DenseNet | gpro.predictor.densenet.densenet | DenseNet_language | [3] |
DenseLSTM | Predictor based on DenseNet and LSTM | gpro.predictor.denselstm.denselstm | DenseLSTM_language | [3] |
AttnBiLSTM | Predictor based on Multihead Attention Layer and Bi-directional LSTM | gpro.predictor.attnbilstm.attnbilstm | AttnBilstm_language | [3] |
GRUClassifier | Binary Classifier based on GRU structure | gpro.predictor.others.GRUClassifier | GRUClassifier_language | [8] |
DeepSTARR2 | Regressive DeepSTARR2 predictor and corresponding Binary Classifier | gpro.predictor.deepstarr2.deepstarr2, gpro.predictor.deepstarr2.deepstarr2_binary | DeepSTARR2_language, DeepSTARR2_binary_language | [11], [12] |
4.3 Optimizer
Optimizer | Description | Import Method | Class Name | Citations |
---|---|---|---|---|
Filter | Using predictive model for filtering after generating a large quantity of novel sequences | gpro.optimizer.model_driven.filter | Filter | |
Genetic | Using genetic algorithm for searching novel sequences that maximum the predictor | gpro.optimizer.heuristic.genetic | GeneticAlgorithm | |
Annealing | Using simulated annealing algorithm for searching novel sequences that maximum the predictor | gpro.optimizer.heuristic.annealing | AnnealingAlgorithm | |
GDS | Using gradient descending algorithm for searching best hidden space that maximum the predictor | gpro.optimizer.model_driven.gradient | GradientAlgorithm | |
Feedback | Using feedback strategy for directly searching training set that maximum the predictor | gpro.optimizer.model_driven.feedback | Feedback | [8] |
Drift | Using random drift for simulating natural selection | gpro.optimizer.evolution.drift | Drift | [9] |
SSWM | Using Strong-Selection-Weak Mutation for simulating natural selection | gpro.optimizer.evolution.sswm | SSWM | [9] |
4.4 Evaluator
Evaluator | Description | Import Method | Function Name | Citations |
---|---|---|---|---|
Kmer frequency | Estimating the generated sequences similarity with natural sequences, a criteria for generator | gpro.evaluator.kmer | plot_kmer_with_model, plot_kmer | [1] |
saturation mutagenesis | Mutate all site at the sequence set, using predictor for weighting the importance at each site | gpro.evaluator.mutagenesis | plot_mutagenesis | |
coefficient | Estimating the predicted expression levels' similarity with natural expressions, a criteria for predictor | gpro.evaluator.regression | plot_regression_performance | [5] |
saliency map | Evaluate the importance of each location for the predictor | gpro.evaluator.saliency | plot_saliency_map | |
seqlogo | Plot the seqlogo that consistent with the predictor | gpro.evaluator.seqlogo | plot_seqlogos | |
blastn plot | Visualize the report of blastn | from gpro.evaluator.blast_plot import blastn_evaluation | blastn_evaluation |
4.5 Utils
In addition, we provide some common functions in gpro.utils
, such as reading fasta files, converting data formats, checking formats, disrupting and partitioning training sets. See Utils for detailed information.
Citations
[1] synthetic promoter design in e.coli based on deep generative models
[2] Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions
[3] Deep flanking sequence engineering for efficient promoter design
[4] Vaishnav E D, de Boer C G, Molinet J, et al. The evolution, evolvability and engineering of gene regulatory DNA[J]. Nature, 2022, 603(7901): 455-463.
[5] Model-driven generation of artificial yeast promoters
[6] Zhang, P., Wang, H., Xu, H. et al. Deep flanking sequence engineering for efficient promoter design using DeepSEED. Nat Commun 14, 6309 (2023). https://doi.org/10.1038/s41467-023-41899-y
[7] Brookes D, Park H, Listgarten J. Conditioning by adaptive sampling for robust design\[C\]//International conference on machine learning. PMLR, 2019: 773-782.
[8] Gupta A, Zou J. Feedback GAN for DNA optimizes protein functions[J]. Nature Machine Intelligence, 2019, 1(2): 105-111.
[9] Vaishnav, E.D., de Boer, C.G., Molinet, J. et al. The evolution, evolvability and engineering of gene regulatory DNA. Nature 603, 455–463 (2022). https://doi.org/10.1038/s41586-022-04506-6
[10] Valeri J A, Soenksen L R, Collins K M, et al. BioAutoMATED: An end-to-end automated machine learning tool for explanation and design of biological sequences[J]. Cell Systems, 2023, 14(6): 525-542. e9.
[11] de Almeida B P, Reiter F, Pagani M, et al. DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers[J]. Nature Genetics, 2022, 54(5): 613-624.
[12] de Almeida B P, Schaub C, Pagani M, et al. Targeted design of synthetic enhancers for selected tissues in the Drosophila embryo[J]. Nature, 2023: 1-2.