generators_alt2.py - cmikke97/Automatic-Malware-Signature-Generation GitHub Wiki

In this page

Imported Modules



  • from .dataset_alt import Dataset

Back to top

Classes and functions

get_batch(tensors, batch_size, i, return_malicious, return_counts, return_tags) (function) - Get a batch of data from the dataset.

  • tensors (arg) - Dataset tensors -> S (shas, optional), X (features) and y (labels)
  • batch_size (arg) - How many samples to load
  • i (arg) - Current batch index
  • return_malicious (arg) - Whether to return the malicious label for the data points or not (default: False)
  • return_counts (arg) - Whether to return the counts for the data points or not (default: False)
  • return_tags (arg) - Whether to return the tags for the data points or not (default: False)

get_batch_unpack(args) (function) - Pass through function for unpacking get_batch arguments.

  • args (arg) - Arguments dictionary

FastTensorDataLoader (class) - A DataLoader-like object for a set of tensors that can be much faster than TensorDataset + DataLoader because Pytorch dataloader grabs individual indices of the dataset and calls cat (slow). Inspired by the 'shuffle in-place' version of https://discuss.pytorch.org/t/dataloader-much-slower-than-manual-batching/27014/6. It was modified from the original version available at the above link to be able to work with the pre-processed dataset (numpy memmap) and with multiple workers (in multiprocessing).

  • __init__(self, *tensors, batch_size, shuffle, num_workers, use_malicious_labels, use_count_labels, use_tag_labels) (member function) - Initialize FastTensorDataLoader class.
    • tensors (arg) - Dataset Tensors. Must have the same length @ dim 0
    • batch_size (arg) - Size of the batches to load (default: 1024)
    • shuffle (arg) - If True, shuffle the data whenever an iterator is created out of this object (default: False)
    • num_workers (arg) - How many workers (threads) to use for data loading (default: None -> 1 worker)
    • use_malicious_labels (arg) - Whether to return the malicious label for the data points or not (default: False)
    • use_count_labels (arg) - Whether to return the counts for the data points or not (default: False)
    • use_tag_labels (arg) - Whether to return the tags for the data points or not (default: False)
  • __del__(self) (member function) - FastTensorDataLoader destructor.
  • __iter__(self) (member function) - Returns the FastTensorDataLoader (dataset iterator) itself.
  • __next__(self) (member function) - Get next batch of data.
  • __len__(self) (member function) - Get FastDataLoader length (number of batches).

GeneratorFactory (class) - Generator factory class.

  • __init__(self, ds_root, batch_size, mode, num_workers, n_samples, use_malicious_labels, use_count_labels, use_tag_labels, return_shas, shuffle) (member function) - Initialize generator factory.
    • ds_root (arg) - Path of the directory where to find the pre-processed dataset (containing .dat files)
    • batch_size (arg) - How many samples per batch to load (default: None -> 1024)
    • mode (arg) - Mode of use of the dataset object (it may be 'train', 'validation' or 'test') (default: 'train')
    • num_workers (arg) - How many workers (threads) to use for data loading(default: max_workers)
    • n_samples (arg) - Number of samples to consider (used just to access the right pre-processed files) (default: None -> all)
    • use_malicious_labels (arg) - Whether to return the malicious label for the data points or not (default: False)
    • use_count_labels (arg) - Whether to return the counts for the data points or not (default: False)
    • use_tag_labels (arg) - Whether to return the tags for the data points or not (default: False)
    • return_shas (arg) - Whether to return the sha256 of the data points or not (default: False)
    • shuffle (arg) - Set to True to have the data reshuffled at every epoch (default: None -> if mode is 'train' then shuffle is set to True, otherwise it is set to False)
  • __call__(self) (member function) - Generator-factory call method.

get_generator(ds_root, batch_size, mode, num_workers, n_samples, use_malicious_labels, use_count_labels, use_tag_labels, return_shas, shuffle) (function) - Get generator based on the provided arguments.

  • ds_root (arg) - Path of the directory where to find the pre-processed dataset (containing .dat files)
  • batch_size (arg) - How many samples per batch to load (default: 8192)
  • mode (arg) - Mode of use of the dataset object (may be 'train', 'validation' or 'test') (default: 'train')
  • num_workers (arg) - How many workers (threads) to use for data loading (if None -> set to current system cpu count) (default: None)
  • n_samples (arg) - Number of samples to consider (used just to access the right pre-processed files) (default: None -> all)
  • use_malicious_labels (arg) - Whether to return the malicious label for the data points or not (default: True)
  • use_count_labels (arg) - Whether to return the counts for the data points or not (default: True)
  • use_tag_labels (arg) - Whether to return the tags for the data points or not (default: True)
  • return_shas (arg) - Whether to return the sha256 of the data points or not (default: False)
  • shuffle (arg) - Set to True to have the data reshuffled at every epoch (default: None -> if mode is 'train' then shuffle is set to True, otherwise it is set to False)

Back to top

⚠️ **GitHub.com Fallback** ⚠️