sorel_generators.py - cmikke97/Automatic-Malware-Signature-Generation GitHub Wiki

In this page

Imported Modules


  • from torch.utils import data - it is needed for the Dataloader which is at the heart of PyTorch data loading utility - torch.utils.data documentation

  • from .sorel_dataset import Dataset

Back to top

Classes and functions

GeneratorFactory (class) - Generator factory class.

  • __init__(self, ds_root, batch_size, mode, num_workers, n_samples, use_malicious_labels, use_count_labels, use_tag_labels, return_shas, features_lmdb, remove_missing_features, shuffle) (member function) - Initialize generator factory.
    • ds_root (arg) - Dataset root directory (where to find meta.db file)
    • batch_size (arg) - How many samples per batch to load (default: None)
    • mode (arg) - Mode of use of the dataset object (may be 'train', 'validation' or 'test') (default: 'train')
    • num_workers (arg) - How many subprocesses to use for data loading by the Dataloader (default: max_workers)
    • n_samples (arg) - Number of samples to consider (-1 if you want to consider them all) (default: -1)
    • use_malicious_labels (arg) - Whether to return the malicious label for the data points or not (default: False)
    • use_count_labels (arg) - Whether to return the counts for the data points or not (default: False)
    • use_tag_labels (arg) - Whether to return the tags for the data points or not (default: False)
    • return_shas (arg) - Whether to return the sha256 of the data points or not (default: False)
    • features_lmdb (arg) - Name of the file containing the ember_features for the data (default: 'ember_features')
    • remove_missing_features (arg) - Whether to remove data points with missing features or not; it can be False/None/'scan'/filepath. In case it is 'scan' a scan will be performed on the database in order to remove the data points with missing features; in case it is a filepath then a file (in Json format) will be used to determine the data points with missing features (default: 'scan')
    • shuffle (arg) - Set to True to have the data reshuffled at every epoch (default: False)
  • __call__(self) (member function) - Generator-factory call method.

get_generator(ds_root, batch_size, mode, num_workers, n_samples, use_malicious_labels, use_count_labels, use_tag_labels, return_shas, features_lmdb, remove_missing_features, shuffle) (function) - Initialize generator factory.

  • ds_root (arg) - Dataset root directory (where to find meta.db file)
  • batch_size (arg) - How many samples per batch to load (default: 8192)
  • mode (arg) - Mode of use of the dataset object (may be 'train', 'validation' or 'test') (default: 'train')
  • num_workers (arg) - How many subprocesses to use for data loading by the Dataloader (default: None)
  • n_samples (arg) - Number of samples to consider (-1 if you want to consider them all) (default: -1)
  • use_malicious_labels (arg) - Whether to return the malicious label for the data points or not (default: True)
  • use_count_labels (arg) - Whether to return the counts for the data points or not (default: True)
  • use_tag_labels (arg) - Whether to return the tags for the data points or not (default: True)
  • return_shas (arg) - Whether to return the sha256 of the data points or not (default: False)
  • features_lmdb (arg) - Name of the file containing the ember_features for the data
  • remove_missing_features (arg) - Whether to remove data points with missing features or not; it can be False/None/'scan'/filepath. In case it is 'scan' a scan will be performed on the database in order to remove the data points with missing features; in case it is a filepath then a file (in Json format) will be used to determine the data points with missing features (default: 'scan')
  • shuffle (arg) - Set to True to have the data reshuffled at every epoch (default: False)

Back to top

⚠️ **GitHub.com Fallback** ⚠️