preprocess_dataset.py - cmikke97/Automatic-Malware-Signature-Generation GitHub Wiki

In this page

Imported Modules

import configparser - implements a basic configuration language for Python programs - configparser documentation
import os - provides a portable way of using operating system dependent functionality - os documentation
import sys - system-specific parameters and functions - sys documentation

import baker - easy, powerful access to Python functions from the command line - baker documentation
import mlflow - open source platform for managing the end-to-end machine learning lifecycle - mlflow documentation
import numpy as np - the fundamental package for scientific computing with Python - numpy documentation
import torch - tensor library like NumPy, with strong GPU support - pytorch documentation
from logzero import logger - robust and effective logging for Python - logzero documentation
from tqdm import tqdm - instantly makes loops show a smart progress meter - tqdm documentation

from generators.sorel_dataset import Dataset
from generators.sorel_generators import get_generator
from utils.preproc_utils import check_files
from utils.preproc_utils import steps

Classes and functions

preprocess_dataset(ds_path, destination_dir, training_n_samples, validation_n_samples, test_n_samples, batch_size, workers, remove_missing_features, binarize_tag_labels) (function, baker command) - Pre-process Sorel20M dataset.

ds_path (arg) - The path to the directory containing the meta.db file
destination_dir (arg) - The directory where to save the pre-processed dataset files
training_n_samples (arg) - Max number of training data samples to use (if 0 -> takes all) (default: 0)
validation_n_samples (arg) - Max number of validation data samples to use (if 0 -> takes all) (default: 0)
test_n_samples (arg) - Max number of test data samples to use (if 0 -> takes all) (default: 0)
batch_size (arg) - How many samples per batch to load (default: 8192)
workers (arg) - How many worker processes should the dataloader use (if None use multiprocessing.cpu_count()) (default: None)
remove_missing_features (arg) - Whether to remove data points with missing features or not; it can be False/None/'scan'/filepath. In case it is 'scan' a scan will be performed on the database in order to remove the data points with missing features; in case it is a filepath then a file (in Json format) will be used to determine the data points with missing features (default: 'scan')
binarize_tag_labels (arg) - Whether to binarize or not the tag values (default: True)

__main__ (main) - Start baker in order to make it possible to run the script and use function names and parameters as the command line interface, using optparse-style options

Back to top

Repository file structure

root/
|
├── src/
|   |
|   ├── FreshDatasetBuilder/
|   |   |
|   |   ├── emberFeatures/
|   |   |   |
|   |   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   |   ├── features.py  - - - - - - - - - - - - - - - (features python code 📖Wiki)
|   |   |   └── vectorize_features.py  - - - - - - - - - - (vectorize features python code 📖Wiki)
|   |   |
|   |   ├── utils/
|   |   |   |
|   |   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   |   ├── fresh_dataset_utils.py - - - - - - - - - - (fresh dataset utils python code 📖Wiki)
|   |   |   └── malware_bazaar_api.py  - - - - - - - - - - (malware bazaar API python code 📖Wiki)
|   |   |
|   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   └── build_fresh_dataset.py - - - - - - - - - - (fresh dataset builder python code 📖Wiki)
|   |
|   ├── Model/
|   |   |
|   |   ├── nets/
|   |   |   |
|   |   |   ├── generators/
|   |   |   |   |
|   |   |   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   |   |   ├── dataset.py - - - - - - - - - - - - - - - - (dataset (base) code 📖Wiki)
|   |   |   |   ├── dataset_alt.py - - - - - - - - - - - - - - (dataset_alt code 📖Wiki)
|   |   |   |   ├── fresh_dataset.py - - - - - - - - - - - - - (fresh_dataset code 📖Wiki)
|   |   |   |   ├── fresh_generators.py  - - - - - - - - - - - (fresh_generators code 📖Wiki)
|   |   |   |   ├── generators.py  - - - - - - - - - - - - - - (generators (base) code 📖Wiki)
|   |   |   |   ├── generators_alt1.py - - - - - - - - - - - - (generators_alt1 code 📖Wiki)
|   |   |   |   ├── generators_alt2.py - - - - - - - - - - - - (generators_alt2 code 📖Wiki)
|   |   |   |   └── generators_alt3.py - - - - - - - - - - - - (generators_alt3 code 📖Wiki)
|   |   |   |
|   |   |   ├── utils/
|   |   |   |   |
|   |   |   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   |   |   └── Net.py - - - - - - - - - - - - - - - - - - (Net code 📖Wiki)
|   |   |   |
|   |   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   |   ├── ALOHA_net.py - - - - - - - - - - - - - - - (ALOHA_net code 📖Wiki)
|   |   |   ├── Contrastive_Model_net.py - - - - - - - - - (Contrastive_Model_net code 📖Wiki)
|   |   |   ├── Family_Classifier_net.py - - - - - - - - - (Family_Classifier_net code 📖Wiki)
|   |   |   ├── MTJE_net.py  - - - - - - - - - - - - - - - (MTJE_net code 📖Wiki)
|   |   |   ├── MTJE_net_cosine.py - - - - - - - - - - - - (MTJE_net_cosine code 📖Wiki)
|   |   |   └── MTJE_net_pairwise_distance.py  - - - - - - (MTJE_net_pairwise_distance code 📖Wiki)
|   |   |
|   |   ├── utils/
|   |   |   |
|   |   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   |   ├── contrastive_utils.py - - - - - - - - - - - (contrastive_utils code 📖Wiki)
|   |   |   ├── opt_utils.py - - - - - - - - - - - - - - - (opt_utils code 📖Wiki)
|   |   |   ├── plot_utils.py  - - - - - - - - - - - - - - (plot_utils code 📖Wiki)
|   |   |   └── ranking_metrics.py - - - - - - - - - - - - (ranking_metrics code 📖Wiki)
|   |   |
|   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   ├── evaluate.py  - - - - - - - - - - - - - - - (evaluate code 📖Wiki)
|   |   ├── evaluate_contrastive.py  - - - - - - - - - (evaluate_contrastive code 📖Wiki)
|   |   ├── evaluate_family_classifier.py  - - - - - - (evaluate_family_classifier code 📖Wiki)
|   |   ├── evaluate_fresh.py  - - - - - - - - - - - - (evaluate_fresh code 📖Wiki)
|   |   ├── gen3_speed_evaluation.py - - - - - - - - - (gen3_speed_evaluation code 📖Wiki)
|   |   ├── plot.py  - - - - - - - - - - - - - - - - - (plot code 📖Wiki)
|   |   ├── plot_contrastive.py  - - - - - - - - - - - (plot_contrastive code 📖Wiki)
|   |   ├── plot_family_classifier.py  - - - - - - - - (plot_family_classifier code 📖Wiki)
|   |   ├── plot_fresh.py  - - - - - - - - - - - - - - (plot_fresh code 📖Wiki)
|   |   ├── train.py - - - - - - - - - - - - - - - - - (train code 📖Wiki)
|   |   ├── train_contrastive.py - - - - - - - - - - - (train_contrastive code 📖Wiki)
|   |   └── train_family_classifier.py - - - - - - - - (train_family_classifier code 📖Wiki)
|   |
|   ├── Sorel20mDataset/
|   |   |
|   |   ├── generators/
|   |   |   |
|   |   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   |   ├── sorel_dataset.py - - - - - - - - - - - - - (sorel_dataset code 📖Wiki)
|   |   |   └── sorel_generators.py  - - - - - - - - - - - (sorel_generators code 📖Wiki)
|   |   |
|   |   ├── utils/
|   |   |   |
|   |   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   |   ├── download_utils.py  - - - - - - - - - - - - (download_utils code 📖Wiki)
|   |   |   └── preproc_utils.py - - - - - - - - - - - - - (preproc_utils code 📖Wiki)
|   |   |
|   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   ├── preprocess_dataset.py  - - - - - - - - - - (preprocess_dataset code 📖Wiki)
|   |   ├── preprocess_ds_multi.py - - - - - - - - - - (preprocess_ds_multi code 📖Wiki)
|   |   └── sorel20mDownloader.py  - - - - - - - - - - (sorel20mDownloader code 📖Wiki)
|   |
|   ├── utils/
|   |   |
|   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   └── workflow_utils.py  - - - - - - - - - - - - - - - - - (workflow_utils code 📖Wiki)
|   |
|   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   ├── config.ini - - - - - - - - - - - - - - - - (configuration file 📖Wiki)
|   └── main.py  - - - - - - - - - - - - - - - - - (main code 📖Wiki)
|
├── MLproject  - - - - - - - - - - - - - - - - (MLproject file)
├── README.md  - - - - - - - - - - - - - - - - (README)
└── conda.yaml - - - - - - - - - - - - - - - - (conda yaml environment)