7. Datasets - WangLabTHU/GPro GitHub Wiki
hcwang and qxdu edited on Aug 4, 2023, 1 version
Datasets
Both we and other researchers have successfully demonstrated the power of generative models in expanding the scope of accessible sequence space and designing novel promoters with desired functions in E.coli, Yeast, and mammalian cells. Here, we provided the detailed information of datasets in folder https://github.com/WangLabTHU/GPro/tree/main/data . Through our preprocessed dataset, you can easily replicate the current influential work, or test your own model. We encourage you to provide us with your own model or datasets.
Caution: see readme in demo file for further information.
Name | Species | Length (bp) | Description | Citations |
---|---|---|---|---|
ecoli_165_cgan_wanglab | Escherichia coli | 165 | flanking sequences for conditional generation | [1] |
ecoli_165_cross_species_wanglab | Escherichia coli | 165 | MPRA, cross species element design | [2], [3] |
ecoli_50_wgan_diffusion_wanglab | Escherichia coli | 50 | MPRA, synthetic promoter with generative AI | [4] |
protein_rand_feedback_jameszou | antimicrobial sequences | random | training datasets for feedback GAN | [5] |
yeast_1000_expression_gan_aleksej | Saccharomyces cerevisiae | 1000 | cross 4 regulatory regions | [6] |
yeast_110_evolution_aviv | Saccharomyces cerevisiae | 110 | core 80 bps, complex media | [7], [8] |
drosophila_epidermis_1001_activity_axlexander | Drosophila | 1001 | including accessibility and activities | [9], [10] |
Citations
[1] Zhang, P., Wang, H., Xu, H. et al. Deep flanking sequence engineering for efficient promoter design using DeepSEED. Nat Commun 14, 6309 (2023). https://doi.org/10.1038/s41467-023-41899-y
[2] Johns N I, Gomes A L C, Yim S S, et al. Metagenomic mining of regulatory elements enables programmable species-selective gene expression[J]. Nature methods, 2018, 15(5): 323-329.
[3] Systematic representation and optimization enable the inverse design of cross-species regulatory sequences in bacteria. Submitted.
[4] Wang Y, Wang H, Wei L, et al. Synthetic promoter design in Escherichia coli based on a deep generative network[J]. Nucleic Acids Research, 2020, 48(12): 6403-6412.
[5] Gupta A, Zou J. Feedback GAN for DNA optimizes protein functions[J]. Nature Machine Intelligence, 2019, 1(2): 105-111.
[6] Zrimec J, Fu X, Muhammad A S, et al. Controlling gene expression with deep generative design of regulatory DNA[J]. Nature communications, 2022, 13(1): 5099.
[7] Gupta A, Zou J. Feedback GAN for DNA optimizes protein functions[J]. Nature Machine Intelligence, 2019, 1(2): 105-111.
[8] de Boer C G, Vaishnav E D, Sadeh R, et al. Deciphering eukaryotic gene-regulatory logic with 100 million random promoters[J]. Nature biotechnology, 2020, 38(1): 56-65.
[9] de Almeida B P, Reiter F, Pagani M, et al. DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers[J]. Nature Genetics, 2022, 54(5): 613-624.
[10] de Almeida B P, Schaub C, Pagani M, et al. Targeted design of synthetic enhancers for selected tissues in the Drosophila embryo[J]. Nature, 2023: 1-2.