4.5 Utils - WangLabTHU/GPro GitHub Wiki
hcwang and qxdu edited on Aug 4, 2023, 1 version
Some well encapsulated functions can greatly facilitate users. Here we provide a series of utils functions, you can easily call them, even in environments that are not suitable for machine learning.
Caution: Here we use "fasta" to describe ".fa" format, and "sequence" to describe the format of one string per line.
base
utils.base
includes some functions suitable for basic dataset processing. You can call them through:
from gpro.utils.base import xxx
Here are the functions:
Generator | params | Description |
---|---|---|
open_fa |
file : input path |
This function can be used to open fasta or normal sequence format |
write_fa |
file : output path, data : data to be stored |
sequences will be saved in fasta format |
write_seq |
file : output path, data : data to be stored |
sequences will be saved in sequence format |
write_exp |
file : output path, data : data to be stored |
expression value will be saved in sequence format |
write_profile |
file : output path, seqs : seqs to be stored, pred : predictions to be instored |
both sequences and predictions will be saved in .csv file |
freq_table_generation |
samples : input sequences |
ppm matrix of the input datasets will be generated |
plot_weblogos |
file , seqs , plot_mode |
see https://github.com/WangLabTHU/GPro/wiki/4.4.5-WebLogo for detailed instructions |
data_check |
datapath: input path | check whether the input sequences text can be used for training of Generator or Predictor |
For example, you can check the availability of your training data:
from gpro.utils.base import data_check
default_root = "your working directory"
dataset = os.path.join(default_root, 'data/seq.txt')
data_check(dataset)
utils_predictor
utils.utils_predictor
includes some functions for predictor training and testing. You can call them through:
from gpro.utils_predictor import xxx
Here are the functions:
Generator | params | Description |
---|---|---|
csv2fasta |
csv_path : input path, data_path : saving path, data_name : saving prefix |
only useful for cGAN , see code for demo5 https://github.com/WangLabTHU/GPro/wiki/5.-Demos#demo5-conditional-gan-for-promoter-flanking-sequences-design for further instructions |
EarlyStopping |
None, no external interface | class, used for stopping the training of predictor when the coefficients on valid set never grown for patience steps |
seq2onehot |
seq : input data, length : input seq length |
converting input sequences into onehot encoded data, also used in generation and optimization |
open_exp |
file : input path, operator : default "log2", preprocessing on expression levels |
read expression data |
dataset_shuffle |
seqpath : input sequence path, exppath : input expression path, savetag : default "False", whether saving results |
inputs will be disrupted to prevent only learning local features during the training process of the predictor. Results will have a "_shuffle" suffix. |
dataset_split |
seqpath : input sequence path, exppath : input expression path, ratio : default "0.8", the ratio for training set, remaining part will be saved for testing, savetag : default "False", whether saving results |
inputs will be split into training and testing datasets. Results will have a "_train"/"_test" suffix. |
open_fa |
file : input path |
This function can be used to open fasta or normal sequence format |
write_exp |
file : output path, data : data to be stored |
expression value will be saved in sequence format |
For example, you can shuffle or split your training data:
from gpro.utils.utils_predictor import dataset_shuffle, dataset_split
default_root = "your working directory"
seqpath = os.path.join(default_root, 'data/seq.txt')
exppath = os.path.join(default_root, 'data/exp.txt')
dataset_shuffle(seqpath, exppath, savetag=True)
dataset_split(seqpath, exppath, savetag=True)
utils_evaluator
utils.utils_evaluator
includes some functions for evaluation and further experiment. You can call them through:
from gpro.utils_evaluator import xxx
Here are the functions:
Generator | params | Description |
---|---|---|
read_fa |
file_name : input path |
This function can be used to open fasta or normal sequence format |
seq2onehot |
seq : input data, length : input seq length |
converting input sequences into onehot encoded data, also used in generation and optimization |
filter_for_experiment |
seqpath : input sequence path,, gc_strength : default "[0.2,0.8]", poly_strength : default "5" |
This file will filter the sequences that might be difficult to implement in biological experiments, gc_strength provide the min and max ratio of G/C, as high or low GC content always brings instability; poly_strength indicates the maximum allowed length of poly A/T, since this structure is prone to introducing mutations during the annealing process. Filtered sequences and reports will be saved. Results will have a "_filter" suffix. |
For example, here is a filter step, allowing us picking sequences that are more likely to be success.
from gpro.utils.utils_evaluator import filter_for_experiment
default_root = "your working directory"
seqpath = os.path.join(default_root, 'data/optimized_seqs.txt')
filter_for_experiment(seqpath)