fresh_generators.py - cmikke97/Automatic-Malware-Signature-Generation GitHub Wiki
-
import math- provides access to the mathematical functions defined by the C standard - math documentation -
from multiprocessing import cpu_count- used to get the number of CPUs in the system - multiprocessing documentation
-
import numpy as np- the fundamental package for scientific computing with Python - numpy documentation -
from torch.utils import data- we need it for the Dataloader which is at the heart of PyTorch data loading utility - torch.utils.data documentation
from .fresh_dataset import Dataset
train_valid_test_split(*tensors, proportions, n_samples_tot, n_families) (function) - Split a set of already instantiated tensors (memmaps) into train, valid and test subsplits following the given proportions keeping balance between families.
-
*tensors(arg) - Set of tensors to split -
proportions(arg) - Proportions (floats) to follow while splitting the tensors -
n_samples_tot(arg) - Total number of samples in the original tensors -
n_families(arg) - Number of families in the original tensors
GeneratorFactory (class) - Generator factory class.
-
__init__(self, ds_root, splits, batch_size, num_workers, return_shas, shuffle)(member function) - Initialize generator factory optionally splitting the fresh dataset into training, validation and test subsplits.-
ds_root(arg) - Path of the directory where to find the fresh dataset (containing .dat files) -
splits(arg) - List of 3 ints corresponding to the train, valid and test subsets relative proportions (default: None) -
batch_size(arg) - How many samples per batch to load (default: None) -
num_workers(arg) - How many subprocesses to use for data loading by the Dataloader (default: max_workers) -
return_shas(arg) - Whether to return the sha256 of the data points or not (default: False) -
shuffle(arg) - Set to True to have the data reshuffled at every epoch (default: False)
-
-
__call__(self)(member function) - Generator-factory call method.
get_generator(ds_root, splits, batch_size, num_workers, return_shas, shuffle) (function) - Get generator based on the provided arguments.
-
ds_root(arg) - Path of the directory where to find the fresh dataset (containing .dat files) -
splits(arg) - List of 3 ints corresponding to the train, valid and test subsets relative proportions (default: None) -
batch_size(arg) - How many samples per batch to load (default: 8192) -
num_workers(arg) - How many subprocesses to use for data loading by the Dataloader (if None -> set to current system cpu count) (default: None) -
return_shas(arg) - Whether to return the sha256 of the data points or not (default: False) -
shuffle(arg) - Set to True to have the data reshuffled at every epoch (default: None)