preprocess_dataset.py - cmikke97/Automatic-Malware-Signature-Generation GitHub Wiki
-
import configparser- implements a basic configuration language for Python programs - configparser documentation -
import os- provides a portable way of using operating system dependent functionality - os documentation -
import sys- system-specific parameters and functions - sys documentation
-
import baker- easy, powerful access to Python functions from the command line - baker documentation -
import mlflow- open source platform for managing the end-to-end machine learning lifecycle - mlflow documentation -
import numpy as np- the fundamental package for scientific computing with Python - numpy documentation -
import torch- tensor library like NumPy, with strong GPU support - pytorch documentation -
from logzero import logger- robust and effective logging for Python - logzero documentation -
from tqdm import tqdm- instantly makes loops show a smart progress meter - tqdm documentation
from generators.sorel_dataset import Datasetfrom generators.sorel_generators import get_generatorfrom utils.preproc_utils import check_filesfrom utils.preproc_utils import steps
preprocess_dataset(ds_path, destination_dir, training_n_samples, validation_n_samples, test_n_samples, batch_size, workers, remove_missing_features, binarize_tag_labels) (function, baker command) - Pre-process Sorel20M dataset.
-
ds_path(arg) - The path to the directory containing the meta.db file -
destination_dir(arg) - The directory where to save the pre-processed dataset files -
training_n_samples(arg) - Max number of training data samples to use (if 0 -> takes all) (default: 0) -
validation_n_samples(arg) - Max number of validation data samples to use (if 0 -> takes all) (default: 0) -
test_n_samples(arg) - Max number of test data samples to use (if 0 -> takes all) (default: 0) -
batch_size(arg) - How many samples per batch to load (default: 8192) -
workers(arg) - How many worker processes should the dataloader use (if None use multiprocessing.cpu_count()) (default: None) -
remove_missing_features(arg) - Whether to remove data points with missing features or not; it can be False/None/'scan'/filepath. In case it is 'scan' a scan will be performed on the database in order to remove the data points with missing features; in case it is a filepath then a file (in Json format) will be used to determine the data points with missing features (default: 'scan') -
binarize_tag_labels(arg) - Whether to binarize or not the tag values (default: True)
__main__ (main) - Start baker in order to make it possible to run the script and use function names and parameters as the command line interface, using optparse-style options