sorel_dataset.py - cmikke97/Automatic-Malware-Signature-Generation GitHub Wiki
-
import configparser- implements a basic configuration language for Python programs - configparser documentation -
import json- json encoder and decoder - json documentation -
import os- provides a portable way of using operating system dependent functionality - os documentation -
import sqlite3- provides a SQL interface compliant with the DB-API 2.0 specification - sqlite3 documentation -
import zlib- allows compression and decompression, using the zlib library - zlib documentation
-
import lmdb- python binding for the LMDB βLightningβ Database - lmdb documentation -
import msgpack- efficient binary serialization format - msgpack documentation -
import numpy as np- the fundamental package for scientific computing with Python - numpy documentation -
from logzero import logger- robust and effective logging for Python - logzero documentation -
from torch.utils import data- used to import data.Dataset - torch.utils.data documentation -
from tqdm import tqdm- instantly makes loops show a smart progress meter - tqdm documentation
LMDBReader (class) - Class used to read features in lmdb format.
-
__init__(self, path, postproc_func)(member function) - Init LMDBReader.-
path(arg) - Location of lmdb database -
postproc_func(arg) - Post processing function to apply to data points (default: None)
-
-
__call__(self, key)(member function) - LMDBReader call method.-
key(arg) - Key (sha256) of the data point to retrieve
-
features_postproc_func(x) (function) - Features post-processing function.
-
x(arg) - Data point to apply the post processing function to
tags_postproc_func(x) (function) - Tags post-processing function.
-
x(arg) - Data point to apply the post processing function to
Dataset (class) - Sorel20M Dataset class.
-
__init__(self, metadb_path, features_lmdb_path, return_malicious, return_counts, return_tags, return_shas, mode, binarize_tag_labels, n_samples, offset, remove_missing_features, postprocess_function)(member function) - Initialize dataset class.-
metadb_path(arg) - Path to the metadb file -
features_lmdb_path(arg) - Path to the features lmbd file -
return_malicious(arg) - Whether to return the malicious label for the data point or not (default: True) -
return_counts(arg) - Whether to return the counts for the data point or not (default: True) -
return_tags(arg) - Whether to return the tags for the data points or not (default: True) -
return_shas(arg) - Whether to return the sha256 of the data points or not (default: False) -
mode(arg) - Mode of use of the dataset object (may be 'train', 'validation' or 'test') (default: 'train') -
binarize_tag_labels(arg) - Whether to binarize or not the tag values (default: True) -
n_samples(arg) - Maximum number of data points to consider (-1 if you want to consider them all) (default: -1) -
offset(arg) - Offset where to start retrieving samples (default: 0) -
remove_missing_features(arg) - Whether to remove data points with missing features or not; it can be False/None/'scan'/filepath. In case it is 'scan' a scan will be performed on the database in order to remove the data points with missing features; in case it is a filepath then a file (in Json format) will be used to determine the data points with missing features (default: True) -
postprocess_function(arg) - Post processing function to use on each data point
-
-
__len__(self)(member function) - Get dataset total length. -
__getitem__(self, index)(member function) - Get item from dataset.-
index(arg) - Index of the item to get
-