4.5 Utils - WangLabTHU/GPro GitHub Wiki

hcwang and qxdu edited on Aug 4, 2023, 1 version

Some well encapsulated functions can greatly facilitate users. Here we provide a series of utils functions, you can easily call them, even in environments that are not suitable for machine learning.

Caution: Here we use "fasta" to describe ".fa" format, and "sequence" to describe the format of one string per line.

base

utils.base includes some functions suitable for basic dataset processing. You can call them through:

from gpro.utils.base import xxx

Here are the functions:

Generator params Description
open_fa file: input path This function can be used to open fasta or normal sequence format
write_fa file: output path, data: data to be stored sequences will be saved in fasta format
write_seq file: output path, data: data to be stored sequences will be saved in sequence format
write_exp file: output path, data: data to be stored expression value will be saved in sequence format
write_profile file: output path, seqs: seqs to be stored, pred: predictions to be instored both sequences and predictions will be saved in .csv file
freq_table_generation samples: input sequences ppm matrix of the input datasets will be generated
plot_weblogos file, seqs, plot_mode see https://github.com/WangLabTHU/GPro/wiki/4.4.5-WebLogo for detailed instructions
data_check datapath: input path check whether the input sequences text can be used for training of Generator or Predictor

For example, you can check the availability of your training data:

from gpro.utils.base import data_check
default_root = "your working directory"
dataset = os.path.join(default_root, 'data/seq.txt')
data_check(dataset)

utils_predictor

utils.utils_predictor includes some functions for predictor training and testing. You can call them through:

from gpro.utils_predictor import xxx

Here are the functions:

Generator params Description
csv2fasta csv_path: input path, data_path: saving path, data_name: saving prefix only useful for cGAN, see code for demo5 https://github.com/WangLabTHU/GPro/wiki/5.-Demos#demo5-conditional-gan-for-promoter-flanking-sequences-design for further instructions
EarlyStopping None, no external interface class, used for stopping the training of predictor when the coefficients on valid set never grown for patience steps
seq2onehot seq: input data, length: input seq length converting input sequences into onehot encoded data, also used in generation and optimization
open_exp file: input path, operator: default "log2", preprocessing on expression levels read expression data
dataset_shuffle seqpath: input sequence path, exppath: input expression path, savetag: default "False", whether saving results inputs will be disrupted to prevent only learning local features during the training process of the predictor. Results will have a "_shuffle" suffix.
dataset_split seqpath: input sequence path, exppath: input expression path, ratio: default "0.8", the ratio for training set, remaining part will be saved for testing, savetag: default "False", whether saving results inputs will be split into training and testing datasets. Results will have a "_train"/"_test" suffix.
open_fa file: input path This function can be used to open fasta or normal sequence format
write_exp file: output path, data: data to be stored expression value will be saved in sequence format

For example, you can shuffle or split your training data:

from gpro.utils.utils_predictor import dataset_shuffle, dataset_split
default_root = "your working directory"
seqpath = os.path.join(default_root, 'data/seq.txt')
exppath = os.path.join(default_root, 'data/exp.txt')
dataset_shuffle(seqpath, exppath, savetag=True)
dataset_split(seqpath, exppath, savetag=True)

utils_evaluator

utils.utils_evaluator includes some functions for evaluation and further experiment. You can call them through:

from gpro.utils_evaluator import xxx

Here are the functions:

Generator params Description
read_fa file_name: input path This function can be used to open fasta or normal sequence format
seq2onehot seq: input data, length: input seq length converting input sequences into onehot encoded data, also used in generation and optimization
filter_for_experiment seqpath: input sequence path,, gc_strength: default "[0.2,0.8]", poly_strength: default "5" This file will filter the sequences that might be difficult to implement in biological experiments, gc_strength provide the min and max ratio of G/C, as high or low GC content always brings instability; poly_strength indicates the maximum allowed length of poly A/T, since this structure is prone to introducing mutations during the annealing process. Filtered sequences and reports will be saved. Results will have a "_filter" suffix.

For example, here is a filter step, allowing us picking sequences that are more likely to be success.

from gpro.utils.utils_evaluator import filter_for_experiment
default_root = "your working directory"
seqpath = os.path.join(default_root, 'data/optimized_seqs.txt')
filter_for_experiment(seqpath)