4. Functions - WangLabTHU/GPro GitHub Wiki

hcwang and qxdu edited on Aug 4, 2023, 1 version

The Gpro package allows the biologist to discover novel functinal promoters in undiscovered sequence space, and to perform a wide variety of other quality-control analyses.

The Gpro package mainly provides new sequence discovery algorithms using both Wasserstein Generative Adversarial Networks(WGAN) and Multinomial Diffusion Networks(Diffusion), which have complementary strengths. This package also provides promoter strength prediction algorithms using existed models from various papers, supporting the assesment of both prokaryote and eukaryotic, and covering a range of currently popular models, like DenseNet, DenstNet + LSTM(DenseLSTM), CNN(CNN_K15, CNN_Wangye), Multihead Attention + BiLSTM(AtnnBilstm). These models support the discovery of promoter space. In addition, the Gpro Suite also provides Optimizers for searching sequences for optimization in expression level. Finally, You can also control the sequence quality and visually analyze model performance with different Evaluators.

The Gpro package is comprised of a collection of tools that work together, as shown below. To see what has changed recently you can peruse the release notes. During the training process of the predictor, 70% is used as the training set by default, and 30% is used as the validation set.

Apart from executing the process from initiation to completion, each module mentioned above has been individually packaged and can be imported individually.

Credits: all the evaluators refer to codes of BioAutoMATED([10]), an AutoML pipeline for biological samples.We have fine-tuned these codes to make them more suitable for our tasks.

4.1 Generator

Generator Description Import Method Class Name Citations
WGAN Wasserstein Generative Adversarial Networks gpro.generator.wgan.wgan WGAN_language [1]
Diffusion Multinomial Diffusion gpro.generator.diffusion.diffusion Diffusion_language [2]
cGAN conditional Generative Adversarial Networks gpro.generator.others.cgan.cgan DeepSeed [6]
VAE Variational Auto-Encoder gpro.generator.others.vae.vae SimpleVAE [7]

4.2 Predictor

Predictor Description Import Method Class Name Citations
CNN_K15 CNN network for k1.5 virus promoter gpro.predictor.cnn_k15.cnn_k15 CNN_K15_language
CNN_Wangye CNN network proposed by Wangye gpro.predictor.cnn_wangye.cnn_wangye WangYeModel_language [1]
DenseNet Predictor based on DenseNet gpro.predictor.densenet.densenet DenseNet_language [3]
DenseLSTM Predictor based on DenseNet and LSTM gpro.predictor.denselstm.denselstm DenseLSTM_language [3]
AttnBiLSTM Predictor based on Multihead Attention Layer and Bi-directional LSTM gpro.predictor.attnbilstm.attnbilstm AttnBilstm_language [3]
GRUClassifier Binary Classifier based on GRU structure gpro.predictor.others.GRUClassifier GRUClassifier_language [8]
DeepSTARR2 Regressive DeepSTARR2 predictor and corresponding Binary Classifier gpro.predictor.deepstarr2.deepstarr2, gpro.predictor.deepstarr2.deepstarr2_binary DeepSTARR2_language, DeepSTARR2_binary_language [11], [12]

4.3 Optimizer

Optimizer Description Import Method Class Name Citations
Filter Using predictive model for filtering after generating a large quantity of novel sequences gpro.optimizer.model_driven.filter Filter
Genetic Using genetic algorithm for searching novel sequences that maximum the predictor gpro.optimizer.heuristic.genetic GeneticAlgorithm
Annealing Using simulated annealing algorithm for searching novel sequences that maximum the predictor gpro.optimizer.heuristic.annealing AnnealingAlgorithm
GDS Using gradient descending algorithm for searching best hidden space that maximum the predictor gpro.optimizer.model_driven.gradient GradientAlgorithm
Feedback Using feedback strategy for directly searching training set that maximum the predictor gpro.optimizer.model_driven.feedback Feedback [8]
Drift Using random drift for simulating natural selection gpro.optimizer.evolution.drift Drift [9]
SSWM Using Strong-Selection-Weak Mutation for simulating natural selection gpro.optimizer.evolution.sswm SSWM [9]

4.4 Evaluator

Evaluator Description Import Method Function Name Citations
Kmer frequency Estimating the generated sequences similarity with natural sequences, a criteria for generator gpro.evaluator.kmer plot_kmer_with_model, plot_kmer [1]
saturation mutagenesis Mutate all site at the sequence set, using predictor for weighting the importance at each site gpro.evaluator.mutagenesis plot_mutagenesis
coefficient Estimating the predicted expression levels' similarity with natural expressions, a criteria for predictor gpro.evaluator.regression plot_regression_performance [5]
saliency map Evaluate the importance of each location for the predictor gpro.evaluator.saliency plot_saliency_map
seqlogo Plot the seqlogo that consistent with the predictor gpro.evaluator.seqlogo plot_seqlogos
blastn plot Visualize the report of blastn from gpro.evaluator.blast_plot import blastn_evaluation blastn_evaluation

4.5 Utils

In addition, we provide some common functions in gpro.utils, such as reading fasta files, converting data formats, checking formats, disrupting and partitioning training sets. See Utils for detailed information.

Citations

[1] synthetic promoter design in e.coli based on deep generative models
[2] Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions
[3] Deep flanking sequence engineering for efficient promoter design
[4] Vaishnav E D, de Boer C G, Molinet J, et al. The evolution, evolvability and engineering of gene regulatory DNA[J]. Nature, 2022, 603(7901): 455-463.
[5] Model-driven generation of artificial yeast promoters
[6] Zhang, P., Wang, H., Xu, H. et al. Deep flanking sequence engineering for efficient promoter design using DeepSEED. Nat Commun 14, 6309 (2023). https://doi.org/10.1038/s41467-023-41899-y
[7] Brookes D, Park H, Listgarten J. Conditioning by adaptive sampling for robust design\[C\]//International conference on machine learning. PMLR, 2019: 773-782.
[8] Gupta A, Zou J. Feedback GAN for DNA optimizes protein functions[J]. Nature Machine Intelligence, 2019, 1(2): 105-111.
[9] Vaishnav, E.D., de Boer, C.G., Molinet, J. et al. The evolution, evolvability and engineering of gene regulatory DNA. Nature 603, 455–463 (2022). https://doi.org/10.1038/s41586-022-04506-6
[10] Valeri J A, Soenksen L R, Collins K M, et al. BioAutoMATED: An end-to-end automated machine learning tool for explanation and design of biological sequences[J]. Cell Systems, 2023, 14(6): 525-542. e9.
[11] de Almeida B P, Reiter F, Pagani M, et al. DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers[J]. Nature Genetics, 2022, 54(5): 613-624.
[12] de Almeida B P, Schaub C, Pagani M, et al. Targeted design of synthetic enhancers for selected tissues in the Drosophila embryo[J]. Nature, 2023: 1-2.