build_fresh_dataset.py - cmikke97/Automatic-Malware-Signature-Generation GitHub Wiki

In this page

Imported Modules


  • import baker - easy, powerful access to Python functions from the command line - baker documentation
  • import mlflow - open source platform for managing the end-to-end machine learning lifecycle - mlflow documentation
  • from logzero import logger - robust and effective logging for Python - logzero documentation
  • from tqdm import tqdm - instantly makes loops show a smart progress meter - tqdm documentation

  • from emberFeatures.features import PEFeatureExtractor
  • from emberFeatures.vectorize_features import create_vectorized_features
  • from utils.malware_bazaar_api import MalwareBazaarAPI

Back to top

Classes and functions

retrieve_malware_sample(args) (function) - Pass through function for unpacking retrieve malware samples arguments.

  • args (arg) - Retrieve malware samples arguments

download_and_extract(available_data, family, label, dest_dir, metadata_file_path, raw_features_dest_file, amount, unzip) (function) - Download 'amount' malware samples (and relative metadata) associated with the provided tag/signature from Malware Bazaar.

  • available_data (arg) - Dictionary containing the list of sha256 hashes of all available files for each family
  • family (arg) - Family to retrieve metadata of
  • label (arg) - Numerical label to use for the current family
  • dest_dir (arg) - Destination directory where to save file
  • metadata_file_path (arg) - File where to save samples metadata
  • raw_features_dest_file (arg) - Name of the file where to save the downloaded files' raw features
  • amount (arg) - Amount of samples to retrieve metadata of (default: 10)
  • unzip (arg) - Whether to unzip downloaded file or not (default: False)

extract_raw_features(binary_path, raw_features_dest_file, label, feature_version, print_warnings) (function) - Extract EMBER features from PE file.

  • binary_path (arg) - Path to the PE file
  • raw_features_dest_file (arg) - Where to write raw features
  • label (arg) - Family label
  • feature_version (arg) - EMBER feature version (default: 2)
  • print_warnings (arg) - Whether to print warnings or not (default: False)

build_fresh_dataset(dataset_dest_dir) (function, baker command) - Build fresh dataset retrieving samples from Malware Bazaar given a list of malware families stored in a configuration file.

  • dataset_dest_dir (arg) - Dir where to write the newly created dataset

__main__ (main) - Start baker in order to make it possible to run the script and use function names and parameters as the command line interface, using optparse-style options


Back to top

⚠️ **GitHub.com Fallback** ⚠️