preprocess_dataset.py - cmikke97/Automatic-Malware-Signature-Generation GitHub Wiki

In this page

Imported Modules


  • import baker - easy, powerful access to Python functions from the command line - baker documentation
  • import mlflow - open source platform for managing the end-to-end machine learning lifecycle - mlflow documentation
  • import numpy as np - the fundamental package for scientific computing with Python - numpy documentation
  • import torch - tensor library like NumPy, with strong GPU support - pytorch documentation
  • from logzero import logger - robust and effective logging for Python - logzero documentation
  • from tqdm import tqdm - instantly makes loops show a smart progress meter - tqdm documentation

  • from generators.sorel_dataset import Dataset
  • from generators.sorel_generators import get_generator
  • from utils.preproc_utils import check_files
  • from utils.preproc_utils import steps

Back to top

Classes and functions

preprocess_dataset(ds_path, destination_dir, training_n_samples, validation_n_samples, test_n_samples, batch_size, workers, remove_missing_features, binarize_tag_labels) (function, baker command) - Pre-process Sorel20M dataset.

  • ds_path (arg) - The path to the directory containing the meta.db file
  • destination_dir (arg) - The directory where to save the pre-processed dataset files
  • training_n_samples (arg) - Max number of training data samples to use (if 0 -> takes all) (default: 0)
  • validation_n_samples (arg) - Max number of validation data samples to use (if 0 -> takes all) (default: 0)
  • test_n_samples (arg) - Max number of test data samples to use (if 0 -> takes all) (default: 0)
  • batch_size (arg) - How many samples per batch to load (default: 8192)
  • workers (arg) - How many worker processes should the dataloader use (if None use multiprocessing.cpu_count()) (default: None)
  • remove_missing_features (arg) - Whether to remove data points with missing features or not; it can be False/None/'scan'/filepath. In case it is 'scan' a scan will be performed on the database in order to remove the data points with missing features; in case it is a filepath then a file (in Json format) will be used to determine the data points with missing features (default: 'scan')
  • binarize_tag_labels (arg) - Whether to binarize or not the tag values (default: True)

__main__ (main) - Start baker in order to make it possible to run the script and use function names and parameters as the command line interface, using optparse-style options


Back to top

⚠️ **GitHub.com Fallback** ⚠️