build_fresh_dataset.py - cmikke97/Automatic-Malware-Signature-Generation GitHub Wiki
-
import configparser- implements a basic configuration language for Python programs - configparser documentation -
import json- json encoder and decoder - json documentation -
import multiprocessing- supports spawning processes using an API similar to the threading module - multiprocessing documentation -
import os- provides a portable way of using operating system dependent functionality - os documentation -
import sys- system-specific parameters and functions - sys documentation -
import tempfile- used to create temporary files and directories - tempfile documentation -
from multiprocessing.pool import ThreadPool- pool of worker threads jobs can be submitted to - multiprocessing documentation
-
import baker- easy, powerful access to Python functions from the command line - baker documentation -
import mlflow- open source platform for managing the end-to-end machine learning lifecycle - mlflow documentation -
from logzero import logger- robust and effective logging for Python - logzero documentation -
from tqdm import tqdm- instantly makes loops show a smart progress meter - tqdm documentation
from emberFeatures.features import PEFeatureExtractorfrom emberFeatures.vectorize_features import create_vectorized_featuresfrom utils.malware_bazaar_api import MalwareBazaarAPI
retrieve_malware_sample(args) (function) - Pass through function for unpacking retrieve malware samples arguments.
-
args(arg) - Retrieve malware samples arguments
download_and_extract(available_data, family, label, dest_dir, metadata_file_path, raw_features_dest_file, amount, unzip) (function) - Download 'amount' malware samples (and relative metadata) associated with the provided tag/signature from Malware Bazaar.
-
available_data(arg) - Dictionary containing the list of sha256 hashes of all available files for each family -
family(arg) - Family to retrieve metadata of -
label(arg) - Numerical label to use for the current family -
dest_dir(arg) - Destination directory where to save file -
metadata_file_path(arg) - File where to save samples metadata -
raw_features_dest_file(arg) - Name of the file where to save the downloaded files' raw features -
amount(arg) - Amount of samples to retrieve metadata of (default: 10) -
unzip(arg) - Whether to unzip downloaded file or not (default: False)
extract_raw_features(binary_path, raw_features_dest_file, label, feature_version, print_warnings) (function) - Extract EMBER features from PE file.
-
binary_path(arg) - Path to the PE file -
raw_features_dest_file(arg) - Where to write raw features -
label(arg) - Family label -
feature_version(arg) - EMBER feature version (default: 2) -
print_warnings(arg) - Whether to print warnings or not (default: False)
build_fresh_dataset(dataset_dest_dir) (function, baker command) - Build fresh dataset retrieving samples from Malware Bazaar given a list of malware families stored in a configuration file.
-
dataset_dest_dir(arg) - Dir where to write the newly created dataset
__main__ (main) - Start baker in order to make it possible to run the script and use function names and parameters as the command line interface, using optparse-style options