fresh_generators.py - cmikke97/Automatic-Malware-Signature-Generation GitHub Wiki

In this page

Imported Modules


  • import numpy as np - the fundamental package for scientific computing with Python - numpy documentation
  • from torch.utils import data - we need it for the Dataloader which is at the heart of PyTorch data loading utility - torch.utils.data documentation

  • from .fresh_dataset import Dataset

Back to top

Classes and functions

train_valid_test_split(*tensors, proportions, n_samples_tot, n_families) (function) - Split a set of already instantiated tensors (memmaps) into train, valid and test subsplits following the given proportions keeping balance between families.

  • *tensors (arg) - Set of tensors to split
  • proportions (arg) - Proportions (floats) to follow while splitting the tensors
  • n_samples_tot (arg) - Total number of samples in the original tensors
  • n_families (arg) - Number of families in the original tensors

GeneratorFactory (class) - Generator factory class.

  • __init__(self, ds_root, splits, batch_size, num_workers, return_shas, shuffle) (member function) - Initialize generator factory optionally splitting the fresh dataset into training, validation and test subsplits.
    • ds_root (arg) - Path of the directory where to find the fresh dataset (containing .dat files)
    • splits (arg) - List of 3 ints corresponding to the train, valid and test subsets relative proportions (default: None)
    • batch_size (arg) - How many samples per batch to load (default: None)
    • num_workers (arg) - How many subprocesses to use for data loading by the Dataloader (default: max_workers)
    • return_shas (arg) - Whether to return the sha256 of the data points or not (default: False)
    • shuffle (arg) - Set to True to have the data reshuffled at every epoch (default: False)
  • __call__(self) (member function) - Generator-factory call method.

get_generator(ds_root, splits, batch_size, num_workers, return_shas, shuffle) (function) - Get generator based on the provided arguments.

  • ds_root (arg) - Path of the directory where to find the fresh dataset (containing .dat files)
  • splits (arg) - List of 3 ints corresponding to the train, valid and test subsets relative proportions (default: None)
  • batch_size (arg) - How many samples per batch to load (default: 8192)
  • num_workers (arg) - How many subprocesses to use for data loading by the Dataloader (if None -> set to current system cpu count) (default: None)
  • return_shas (arg) - Whether to return the sha256 of the data points or not (default: False)
  • shuffle (arg) - Set to True to have the data reshuffled at every epoch (default: None)

Back to top

⚠️ **GitHub.com Fallback** ⚠️