sorel_dataset.py - cmikke97/Automatic-Malware-Signature-Generation GitHub Wiki

In this page

Imported Modules



Back to top

Classes and functions

LMDBReader (class) - Class used to read features in lmdb format.

  • __init__(self, path, postproc_func) (member function) - Init LMDBReader.
    • path (arg) - Location of lmdb database
    • postproc_func (arg) - Post processing function to apply to data points (default: None)
  • __call__(self, key) (member function) - LMDBReader call method.
    • key (arg) - Key (sha256) of the data point to retrieve

features_postproc_func(x) (function) - Features post-processing function.

  • x (arg) - Data point to apply the post processing function to

tags_postproc_func(x) (function) - Tags post-processing function.

  • x (arg) - Data point to apply the post processing function to

Dataset (class) - Sorel20M Dataset class.

  • __init__(self, metadb_path, features_lmdb_path, return_malicious, return_counts, return_tags, return_shas, mode, binarize_tag_labels, n_samples, offset, remove_missing_features, postprocess_function) (member function) - Initialize dataset class.
    • metadb_path (arg) - Path to the metadb file
    • features_lmdb_path (arg) - Path to the features lmbd file
    • return_malicious (arg) - Whether to return the malicious label for the data point or not (default: True)
    • return_counts (arg) - Whether to return the counts for the data point or not (default: True)
    • return_tags (arg) - Whether to return the tags for the data points or not (default: True)
    • return_shas (arg) - Whether to return the sha256 of the data points or not (default: False)
    • mode (arg) - Mode of use of the dataset object (may be 'train', 'validation' or 'test') (default: 'train')
    • binarize_tag_labels (arg) - Whether to binarize or not the tag values (default: True)
    • n_samples (arg) - Maximum number of data points to consider (-1 if you want to consider them all) (default: -1)
    • offset (arg) - Offset where to start retrieving samples (default: 0)
    • remove_missing_features (arg) - Whether to remove data points with missing features or not; it can be False/None/'scan'/filepath. In case it is 'scan' a scan will be performed on the database in order to remove the data points with missing features; in case it is a filepath then a file (in Json format) will be used to determine the data points with missing features (default: True)
    • postprocess_function (arg) - Post processing function to use on each data point
  • __len__(self) (member function) - Get dataset total length.
  • __getitem__(self, index) (member function) - Get item from dataset.
    • index (arg) - Index of the item to get

Back to top

⚠️ **GitHub.com Fallback** ⚠️