train.py - cmikke97/Automatic-Malware-Signature-Generation GitHub Wiki
-
import configparser- implements a basic configuration language for Python programs - configparser documentation -
import importlib- provides the implementation of the import statement in Python source code - importlib documentation -
import json- json encoder and decoder - json documentation -
import os- provides a portable way of using operating system dependent functionality - os documentation -
import shutil- used to recursively copy an entire directory tree rooted at src to a directory named dst - shutil documentation -
import sys- system-specific parameters and functions - sys documentation -
import time- provides various time-related functions - time documentation -
from collections import defaultdict- dict subclass that calls a factory function to supply missing values - collections documentation -
from copy import deepcopy- creates a new object and recursively copies the original object elements - copy documentation -
from urllib import parse- standard interface to break Uniform Resource Locator (URL) in components - urllib.parse documentation
-
import baker- easy, powerful access to Python functions from the command line - baker documentation -
import mlflow- open source platform for managing the end-to-end machine learning lifecycle - mlflow documentation -
import numpy as np- the fundamental package for scientific computing with Python - numpy documentation -
import psutil- used for retrieving information on running processes and system utilization - psutil documentation -
import torch- tensor library like NumPy, with strong GPU support - pytorch documentation -
from logzero import logger- robust and effective logging for Python - logzero documentation
from utils.opt_utils import get_opt_statefrom utils.opt_utils import save_opt_state
import_modules(net_type, gen_type) (function) - Dynamically import network, dataset and generator modules depending on the provided arguments.
-
net_type(arg) - Network type (possible values: mtje, mtje_cosine, mtje_pairwise_distance, aloha) -
gen_type(arg) - Generator type (possible values: base, alt1, alt2, alt3)
train_network(ds_path, net_type, gen_type, run_id, training_run, batch_size, epochs, training_n_samples, validation_n_samples, use_malicious_labels, use_count_labels, use_tag_labels, feature_dimension, random_seed, workers) (function, baker command) - Train a feed-forward neural network on EMBER 2.0 features, optionally with additional targets as described in the ALOHA paper (https://arxiv.org/abs/1903.05700). SMART tags based on (https://arxiv.org/abs/1905.06262).
-
ds_path(arg) - Path of the directory where to find the pre-processed dataset (containing .dat files) -
net_type(arg) - Network to use between 'mtje', 'mtje_cosine', 'mtje_pairwise_distance' and 'aloha' (default: 'mtje') -
gen_type(arg) - Generator (and dataset) class to use between 'base', 'alt1', 'alt2' or 'alt3' (default: 'base') -
run_id(arg) - Mlflow run id of a previously stopped run to resume (default: None) -
training_run(arg) - Training run identifier -> to plot base evaluation results with mean and confidence we need at least 2 runs (default: 0) -
batch_size(arg) - How many samples per batch to load (default: 8192) -
epochs(arg) - How many epochs to train for (default: 10) -
training_n_samples(arg) - Number of training samples to consider (used to access the right files) (default: 0 -> all) -
validation_n_samples(arg) - Number of validation samples to consider (used to access the right files) (default: 0 -> all) -
use_malicious_labels(arg) - Whether or not (1/0) to use malware/benignware labels as a target (default: 1) -
use_count_labels(arg) - Whether or not (1/0) to use the counts as an additional target (default: 1) -
use_tag_labels(arg) - Whether or not (1/0) to use the tags as additional targets (default: 1) -
feature_dimension(arg) - The input dimension of the model (default: 2381 -> EMBER 2.0 feature size) -
random_seed(arg) - If provided, seed random number generation with this value (default: None -> no seeding) -
workers(arg) - How many workers (threads) should the dataloader use (default: 0 -> use multiprocessing.cpu_count())
__main__ (main) - Start baker in order to make it possible to run the script and use function names and parameters as the command line interface, using optparse-style options