7. Datasets - WangLabTHU/GPro GitHub Wiki

hcwang and qxdu edited on Aug 4, 2023, 1 version

Datasets

Both we and other researchers have successfully demonstrated the power of generative models in expanding the scope of accessible sequence space and designing novel promoters with desired functions in E.coli, Yeast, and mammalian cells. Here, we provided the detailed information of datasets in folder https://github.com/WangLabTHU/GPro/tree/main/data . Through our preprocessed dataset, you can easily replicate the current influential work, or test your own model. We encourage you to provide us with your own model or datasets.

Caution: see readme in demo file for further information.

Name Species Length (bp) Description Citations
ecoli_165_cgan_wanglab Escherichia coli 165 flanking sequences for conditional generation [1]
ecoli_165_cross_species_wanglab Escherichia coli 165 MPRA, cross species element design [2], [3]
ecoli_50_wgan_diffusion_wanglab Escherichia coli 50 MPRA, synthetic promoter with generative AI [4]
protein_rand_feedback_jameszou antimicrobial sequences random training datasets for feedback GAN [5]
yeast_1000_expression_gan_aleksej Saccharomyces cerevisiae 1000 cross 4 regulatory regions [6]
yeast_110_evolution_aviv Saccharomyces cerevisiae 110 core 80 bps, complex media [7], [8]
drosophila_epidermis_1001_activity_axlexander Drosophila 1001 including accessibility and activities [9], [10]

Citations

[1] Zhang, P., Wang, H., Xu, H. et al. Deep flanking sequence engineering for efficient promoter design using DeepSEED. Nat Commun 14, 6309 (2023). https://doi.org/10.1038/s41467-023-41899-y
[2] Johns N I, Gomes A L C, Yim S S, et al. Metagenomic mining of regulatory elements enables programmable species-selective gene expression[J]. Nature methods, 2018, 15(5): 323-329.
[3] Systematic representation and optimization enable the inverse design of cross-species regulatory sequences in bacteria. Submitted.
[4] Wang Y, Wang H, Wei L, et al. Synthetic promoter design in Escherichia coli based on a deep generative network[J]. Nucleic Acids Research, 2020, 48(12): 6403-6412.
[5] Gupta A, Zou J. Feedback GAN for DNA optimizes protein functions[J]. Nature Machine Intelligence, 2019, 1(2): 105-111.
[6] Zrimec J, Fu X, Muhammad A S, et al. Controlling gene expression with deep generative design of regulatory DNA[J]. Nature communications, 2022, 13(1): 5099.
[7] Gupta A, Zou J. Feedback GAN for DNA optimizes protein functions[J]. Nature Machine Intelligence, 2019, 1(2): 105-111.
[8] de Boer C G, Vaishnav E D, Sadeh R, et al. Deciphering eukaryotic gene-regulatory logic with 100 million random promoters[J]. Nature biotechnology, 2020, 38(1): 56-65.
[9] de Almeida B P, Reiter F, Pagani M, et al. DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers[J]. Nature Genetics, 2022, 54(5): 613-624.
[10] de Almeida B P, Schaub C, Pagani M, et al. Targeted design of synthetic enhancers for selected tissues in the Drosophila embryo[J]. Nature, 2023: 1-2.