4.4.1 Kmer - WangLabTHU/GPro GitHub Wiki
hcwang and qxdu edited on Aug 4, 2023, 1 version
kmer.py can be used to estimate k-mer frequency of generator. Specifically, given a well trained model, the distribution status of k-mer frequency of natural sequences and generated sequences.
We tend to think that a better Generative model can "imitate" the distribution of the original sequence on the 4-9 mer scale. For example, when motif "AATCGG" occurs in natural sequences at a frequency of 0.1%, it should also occurs in generated sequences at a similar frequency.Generally speaking, the characteristics of 6-mer in organisms are the most representative.
As shown in the following figure, we present the restoration of the Diffusion model at a frequency of 6 mer. Each point represent a 6 mer length motif. The data in the upper left corner is the covariance coefficient (
Caution: when using Diffusion, modelpath should always be folder with model_name, but when using WGAN, model_path should be "xxx.pth"
params | description | default value | function |
---|---|---|---|
generator | the trained generator class | None | plot_kmer_with_model |
generator_modelpath | the pretrained model checkpoint | None | plot_kmer_with_model |
generator_training_datapath | path of natural dataset, training set for generator will be the best | None | both |
report_path | saving folder | None | both |
file_tag | saving name | None | both |
K | vector k in k-mer frequency | 6 | both |
num_seqs_to_test | sampling scales for frequency comparison | 10000 | both |
generator_sampling_datapath | samples generated by pretrained model | None | plot_kmer |
plot_kmer_with_model
function needs simply input the trained model directly.
plot_kmer
function needs original training data path and new sampling data path. We recommend these dataset to be in fasta format.
from gpro.evaluator.kmer import plot_kmer_with_model, plot_kmer
model = AttnBilstm_language(length=50, epoch=200)
# Train
project_path = "your project path"
generator_training_datapath = os.path.join(project_path,'data/sequence_data.txt')
generator_modelpath = project_path + '/checkpoints/wgans/checkpoints/net_G_12.pth'
generator = WGAN_language()
from gpro.generator.gans.wgans import WGAN_language
# plot with model sampling
plot_kmer_with_model(generator, generator_modelpath, generator_training_datapath,
report_path="./results/", file_tag="WGAN")
# kmer_direct
generator_sampling_datapath = project_path + '/checkpoints/wgans/samples/sample.txt'
plot_kmer(generator_training_datapath, generator_sampling_datapath, report_path="./results/", file_tag="WGAN")
The final result will be saved in the ./results
directory.