4.4.1 Kmer - WangLabTHU/GPro GitHub Wiki

hcwang and qxdu edited on Aug 4, 2023, 1 version

Introduction

kmer.py can be used to estimate k-mer frequency of generator. Specifically, given a well trained model, the distribution status of k-mer frequency of natural sequences and generated sequences.

We tend to think that a better Generative model can "imitate" the distribution of the original sequence on the 4-9 mer scale. For example, when motif "AATCGG" occurs in natural sequences at a frequency of 0.1%, it should also occurs in generated sequences at a similar frequency.Generally speaking, the characteristics of 6-mer in organisms are the most representative.

As shown in the following figure, we present the restoration of the Diffusion model at a frequency of 6 mer. Each point represent a 6 mer length motif. The data in the upper left corner is the covariance coefficient ($R^2$). This indicates that the Diffusion model generates sequences that are highly similar to natural sequences.

Caution: when using Diffusion, modelpath should always be folder with model_name, but when using WGAN, model_path should be "xxx.pth"

Parameters

params description default value function
generator the trained generator class None plot_kmer_with_model
generator_modelpath the pretrained model checkpoint None plot_kmer_with_model
generator_training_datapath path of natural dataset, training set for generator will be the best None both
report_path saving folder None both
file_tag saving name None both
K vector k in k-mer frequency 6 both
num_seqs_to_test sampling scales for frequency comparison 10000 both
generator_sampling_datapath samples generated by pretrained model None plot_kmer

Demo

plot_kmer_with_model function needs simply input the trained model directly.

plot_kmer function needs original training data path and new sampling data path. We recommend these dataset to be in fasta format.

from gpro.evaluator.kmer import plot_kmer_with_model, plot_kmer

model = AttnBilstm_language(length=50, epoch=200)

# Train
project_path = "your project path"
generator_training_datapath = os.path.join(project_path,'data/sequence_data.txt')
generator_modelpath = project_path + '/checkpoints/wgans/checkpoints/net_G_12.pth'
generator = WGAN_language()

from gpro.generator.gans.wgans import WGAN_language

# plot with model sampling
plot_kmer_with_model(generator, generator_modelpath,  generator_training_datapath,
                         report_path="./results/", file_tag="WGAN")

# kmer_direct
generator_sampling_datapath = project_path + '/checkpoints/wgans/samples/sample.txt' 
plot_kmer(generator_training_datapath, generator_sampling_datapath, report_path="./results/", file_tag="WGAN")

The final result will be saved in the ./results directory.

⚠️ **GitHub.com Fallback** ⚠️