sorel_dataset.py - cmikke97/Automatic-Malware-Signature-Generation GitHub Wiki

In this page

Imported Modules

import configparser - implements a basic configuration language for Python programs - configparser documentation
import json - json encoder and decoder - json documentation
import os - provides a portable way of using operating system dependent functionality - os documentation
import sqlite3 - provides a SQL interface compliant with the DB-API 2.0 specification - sqlite3 documentation
import zlib - allows compression and decompression, using the zlib library - zlib documentation

import lmdb - python binding for the LMDB ‘Lightning’ Database - lmdb documentation
import msgpack - efficient binary serialization format - msgpack documentation
import numpy as np - the fundamental package for scientific computing with Python - numpy documentation
from logzero import logger - robust and effective logging for Python - logzero documentation
from torch.utils import data - used to import data.Dataset - torch.utils.data documentation
from tqdm import tqdm - instantly makes loops show a smart progress meter - tqdm documentation

Back to top

Classes and functions

LMDBReader (class) - Class used to read features in lmdb format.

__init__(self, path, postproc_func) (member function) - Init LMDBReader.
- path (arg) - Location of lmdb database
- postproc_func (arg) - Post processing function to apply to data points (default: None)
__call__(self, key) (member function) - LMDBReader call method.
- key (arg) - Key (sha256) of the data point to retrieve

features_postproc_func(x) (function) - Features post-processing function.

x (arg) - Data point to apply the post processing function to

tags_postproc_func(x) (function) - Tags post-processing function.

x (arg) - Data point to apply the post processing function to

Dataset (class) - Sorel20M Dataset class.

__init__(self, metadb_path, features_lmdb_path, return_malicious, return_counts, return_tags, return_shas, mode, binarize_tag_labels, n_samples, offset, remove_missing_features, postprocess_function) (member function) - Initialize dataset class.
- metadb_path (arg) - Path to the metadb file
- features_lmdb_path (arg) - Path to the features lmbd file
- return_malicious (arg) - Whether to return the malicious label for the data point or not (default: True)
- return_counts (arg) - Whether to return the counts for the data point or not (default: True)
- return_tags (arg) - Whether to return the tags for the data points or not (default: True)
- return_shas (arg) - Whether to return the sha256 of the data points or not (default: False)
- mode (arg) - Mode of use of the dataset object (may be 'train', 'validation' or 'test') (default: 'train')
- binarize_tag_labels (arg) - Whether to binarize or not the tag values (default: True)
- n_samples (arg) - Maximum number of data points to consider (-1 if you want to consider them all) (default: -1)
- offset (arg) - Offset where to start retrieving samples (default: 0)
- remove_missing_features (arg) - Whether to remove data points with missing features or not; it can be False/None/'scan'/filepath. In case it is 'scan' a scan will be performed on the database in order to remove the data points with missing features; in case it is a filepath then a file (in Json format) will be used to determine the data points with missing features (default: True)
- postprocess_function (arg) - Post processing function to use on each data point
__len__(self) (member function) - Get dataset total length.
__getitem__(self, index) (member function) - Get item from dataset.
- index (arg) - Index of the item to get