4.5 Utils - WangLabTHU/GPro GitHub Wiki

hcwang and qxdu edited on Aug 4, 2023, 1 version

Some well encapsulated functions can greatly facilitate users. Here we provide a series of utils functions, you can easily call them, even in environments that are not suitable for machine learning.

Caution: Here we use "fasta" to describe ".fa" format, and "sequence" to describe the format of one string per line.

base

utils.base includes some functions suitable for basic dataset processing. You can call them through:

from gpro.utils.base import xxx

Here are the functions:

Generator	params	Description
`open_fa`	`file`: input path	This function can be used to open fasta or normal sequence format
`write_fa`	`file`: output path, `data`: data to be stored	sequences will be saved in fasta format
`write_seq`	`file`: output path, `data`: data to be stored	sequences will be saved in sequence format
`write_exp`	`file`: output path, `data`: data to be stored	expression value will be saved in sequence format
`write_profile`	`file`: output path, `seqs`: seqs to be stored, `pred`: predictions to be instored	both sequences and predictions will be saved in `.csv` file
`freq_table_generation`	`samples`: input sequences	ppm matrix of the input datasets will be generated
`plot_weblogos`	`file`, `seqs`, `plot_mode`	see https://github.com/WangLabTHU/GPro/wiki/4.4.5-WebLogo for detailed instructions
`data_check`	datapath: input path	check whether the input sequences text can be used for training of Generator or Predictor

For example, you can check the availability of your training data:

from gpro.utils.base import data_check
default_root = "your working directory"
dataset = os.path.join(default_root, 'data/seq.txt')
data_check(dataset)

utils_predictor

utils.utils_predictor includes some functions for predictor training and testing. You can call them through:

from gpro.utils_predictor import xxx

Here are the functions:

Generator	params	Description
`csv2fasta`	`csv_path`: input path, `data_path`: saving path, `data_name`: saving prefix	only useful for `cGAN`, see code for demo5 https://github.com/WangLabTHU/GPro/wiki/5.-Demos#demo5-conditional-gan-for-promoter-flanking-sequences-design for further instructions
`EarlyStopping`	None, no external interface	class, used for stopping the training of predictor when the coefficients on valid set never grown for `patience` steps
`seq2onehot`	`seq`: input data, `length`: input seq length	converting input sequences into onehot encoded data, also used in generation and optimization
`open_exp`	`file`: input path, `operator`: default "log2", preprocessing on expression levels	read expression data
`dataset_shuffle`	`seqpath`: input sequence path, `exppath`: input expression path, `savetag`: default "False", whether saving results	inputs will be disrupted to prevent only learning local features during the training process of the predictor. Results will have a "_shuffle" suffix.
`dataset_split`	`seqpath`: input sequence path, `exppath`: input expression path, `ratio`: default "0.8", the ratio for training set, remaining part will be saved for testing, `savetag`: default "False", whether saving results	inputs will be split into training and testing datasets. Results will have a "_train"/"_test" suffix.
`open_fa`	`file`: input path	This function can be used to open fasta or normal sequence format
`write_exp`	`file`: output path, `data`: data to be stored	expression value will be saved in sequence format

For example, you can shuffle or split your training data:

from gpro.utils.utils_predictor import dataset_shuffle, dataset_split
default_root = "your working directory"
seqpath = os.path.join(default_root, 'data/seq.txt')
exppath = os.path.join(default_root, 'data/exp.txt')
dataset_shuffle(seqpath, exppath, savetag=True)
dataset_split(seqpath, exppath, savetag=True)

utils_evaluator

utils.utils_evaluator includes some functions for evaluation and further experiment. You can call them through:

from gpro.utils_evaluator import xxx

Here are the functions:

Generator	params	Description
`read_fa`	`file_name`: input path	This function can be used to open fasta or normal sequence format
`seq2onehot`	`seq`: input data, `length`: input seq length	converting input sequences into onehot encoded data, also used in generation and optimization
`filter_for_experiment`	`seqpath`: input sequence path,, `gc_strength`: default "[0.2,0.8]", `poly_strength`: default "5"	This file will filter the sequences that might be difficult to implement in biological experiments, `gc_strength` provide the min and max ratio of G/C, as high or low GC content always brings instability; `poly_strength` indicates the maximum allowed length of poly A/T, since this structure is prone to introducing mutations during the annealing process. Filtered sequences and reports will be saved. Results will have a "_filter" suffix.

For example, here is a filter step, allowing us picking sequences that are more likely to be success.

from gpro.utils.utils_evaluator import filter_for_experiment
default_root = "your working directory"
seqpath = os.path.join(default_root, 'data/optimized_seqs.txt')
filter_for_experiment(seqpath)