config.ini - cmikke97/Automatic-Malware-Signature-Generation GitHub Wiki

Tool configuration file. Modify values inside as needed.

In this page

Configuration variables

Configuration variables

Section general

device - desired device to train the model on, e.g. 'cuda:0' if a GPU is available, otherwise 'cpu'
workers - number of workers to be used (if 0 -> set to current system cpu count)
runs - number of training runs to do

Section sorel20mDataset

training_n_samples - max number of training data samples to use (if -1 -> takes all)
validation_n_samples - max number of validation data samples to use (if -1 -> takes all)
test_n_samples - max number of test data samples to use (if -1 -> takes all)
validation_test_split - (should not be changed) timestamp that divides the validation data (used to check convergence/overfitting) from test data (used to assess final performance)
train_validation_split - (should not be changed) timestamp that splits training data from validation data
total_training_samples - (should not be changed) total number of available training samples in the original Sorel20M dataset
total_validation_samples - (should not be changed) total number of available validation samples in the original Sorel20M dataset
total_test_samples - (should not be changed) total number of available test samples in the original Sorel20M dataset

Section aloha

batch_size - how many samples per batch to load
epochs - how many epochs to train for
use_malicious_labels - whether or not (1/0) to use malware/benignware labels as a target
use_count_labels - whether or not (1/0) to use the counts as an additional target
use_tag_labels - whether or not (1/0) to use the tags as additional targets
layer_sizes - define aloha net initial linear layers sizes (and amount). Examples:
- [512,512,128]: the initial layers (before the task branches) will be 3 with sizes 512, 512, 128 respectively
- [512,256]: the initial layers (before the task branches) will be 2 with sizes 512, 256 respectively
dropout_p - dropout probability between the first aloha net layers
activation_function - activation function between the first aloha net layers. Possible values:
- elu: Exponential Linear Unit activation function
- leakyRelu: leaky Relu activation function
- pRelu: parametric Relu activation function (better to use this with weight decay = 0)
- relu: Rectified Linear Unit activation function
normalization_function - normalization function between the first aloha net layers. Possible values:
- layer_norm: the torch.nn.LayerNorm function
- batch_norm: the torch.nn.BatchNorm1d function
loss_weights - label weights to be used during loss calculation (Notice: only the weights corresponding to enabled labels will be used). Example: {'malware': 1.0, 'count': 0.1, 'tags': 1.0}
optimizer - optimizer to use during training. Possible values:
- adam: Adam algorithm
- sgd: stochastic gradient descent
lr - learning rate to use during training
momentum - momentum to be used during training when using 'sgd' optimizer
weight_decay - weight decay (L2 penalty) to use with selected optimizer
gen_type - generator type. Possible values are:
- base: use basic generator (from the original SOREL20M code) modified to work with the pre-processed dataset
- alt1: use alternative generator 1. Inspired by the 'index select' version of https://discuss.pytorch.org/t/dataloader-much-slower-than-manual-batching/27014/6, this version uses a new dataloader class, called FastTensorDataloader, to process tabular based data. It was modified from the original version available at the above link to be able to work with the pre-processed dataset (numpy memmap) and with multiple workers (in multiprocessing)
- alt2: use alternative generator 2. Inspired by the 'shuffle in-place' version of https://discuss.pytorch.org/t/dataloader-much-slower-than-manual-batching/27014/6, this version uses a new dataloader class, called FastTensorDataloader, to process tabular based data. It was modified from the original version available at the above link to be able to work with the pre-processed dataset (numpy memmap) and with multiple workers (in multiprocessing)
- alt3: use alternative generator 3. This version uses a new dataloader class, called FastTensorDataloader which asynchronously (if workers > 1) loads the dataset into memory in randomly chosen chunks which are concatenated together to form a 'chunk aggregate' -> the data inside a chunk aggregate is then shuffled. Finally batches of data are extracted from a chunk aggregate. The samples shuffling is therefore more localised but the loading speed is greatly increased

Section mtje

batch_size - how many samples per batch to load
epochs - how many epochs to train for
use_malicious_labels - whether or not (1/0) to use malware/benignware labels as a target
use_count_labels - whether or not (1/0) to use the counts as an additional target
layer_sizes - define mtje net initial linear layers sizes (and amount). Examples:
- [512,512,128]: the initial layers (before the task branches) will be 3 with sizes 512, 512, 128 respectively
- [512,256]: the initial layers (before the task branches) will be 2 with sizes 512, 256 respectively
dropout_p - dropout probability between the first mtje net layers
activation_function - activation function between the first aloha net layers. Possible values:
- elu: Exponential Linear Unit activation function
- leakyRelu: leaky Relu activation function
- pRelu: parametric Relu activation function (better to use this with weight decay = 0)
- relu: Rectified Linear Unit activation function
normalization_function - normalization function between the first aloha net layers. Possible values:
- layer_norm: the torch.nn.LayerNorm function
- batch_norm: the torch.nn.BatchNorm1d function
loss_weights - label weights to be used during loss calculation (Notice: only the weights corresponding to enabled labels will be used). Example: {'malware': 1.0, 'count': 0.1, 'tags': 1.0}
optimizer - optimizer to use during training. Possible values:
- adam: Adam algorithm
- sgd: stochastic gradient descent
lr - learning rate to use during training
momentum - momentum to be used during training when using 'sgd' optimizer
weight_decay - weight decay (L2 penalty) to use with selected optimizer
gen_type - generator type. Possible values are:
- base: use basic generator (from the original SOREL20M code) modified to work with the pre-processed dataset
- alt1: use alternative generator 1. Inspired by the 'index select' version of https://discuss.pytorch.org/t/dataloader-much-slower-than-manual-batching/27014/6, this version uses a new dataloader class, called FastTensorDataloader, to process tabular based data. It was modified from the original version available at the above link to be able to work with the pre-processed dataset (numpy memmap) and with multiple workers (in multiprocessing)
- alt2: use alternative generator 2. Inspired by the 'shuffle in-place' version of https://discuss.pytorch.org/t/dataloader-much-slower-than-manual-batching/27014/6, this version uses a new dataloader class, called FastTensorDataloader, to process tabular based data. It was modified from the original version available at the above link to be able to work with the pre-processed dataset (numpy memmap) and with multiple workers (in multiprocessing)
- alt3: use alternative generator 3. This version uses a new dataloader class, called FastTensorDataloader which asynchronously (if workers > 1) loads the dataset into memory in randomly chosen chunks which are concatenated together to form a 'chunk aggregate' -> the data inside a chunk aggregate is then shuffled. Finally batches of data are extracted from a chunk aggregate. The samples shuffling is therefore more localised but the loading speed is greatly increased
similarity_measure - similarity measure used to evaluate distances in joint embedding space. Possible values are:
- dot: dot product between vectors in the embedding space. The similarity measure used in mtje paper
- cosine: cosine similarity between vectors in the embedding space
- pairwise_distance: calculates the pairwise distance and then transforms it to a similarity measure (between 0 and 1)
pairwise_distance_to_similarity_function - (IF 'pairwise_distance' IS SELECTED AS similarity_measure) - distance-to-similarity function to use. These functions will map values belonging to the R+ set (Real positives) to real values belonging to the [0,1] interval. Possible values are:
- exp: will compute e^(-x/a)
- inv: will compute 1/(1+x/a)
- inv_pow: will compute 1/(1+(x^2)/a)
where 'a' is a multiplicative factor (see 'pairwise_a')
pairwise_a - (IF 'pairwise_distance' IS SELECTED AS similarity_measure) - distance-to-similarity function 'a' multiplicative factor

Section freshDataset

families - malware Bazaar families of interest. NOTE: It is recommended to specify more families than 'number_of_families' since Malware Bazaar may not have 'amount_each' samples for some of them. These families will be considered in order.
number_of_families - number of families to consider. The ones in excess, going in order, will not be considered.
amount_each - amount of samples for each malware family to retrieve from Malware Bazaar
n_queries - number of query samples per-family to consider
min_n_anchor_samples - minimum number of anchor samples to use, per-family
max_n_anchor_samples - maximum number of anchor samples to use, per-family
n_evaluations - number of evaluations to perform (for uncertainty estimates)

Section familyClassifier

epochs - how many epochs to train the family classifier for
train_split_proportion - proportion of the whole fresh dataset to use for training the family classifier
valid_split_proportion - proportion of the whole fresh dataset to use for validating the family classifier
test_split_proportion - proportion of the whole fresh dataset to use for testing the family classifier
batch_size - how many samples per batch to load for the family classifier
optimizer - optimizer to use during training. Possible values:
- adam: Adam algorithm
- sgd: stochastic gradient descent
lr - learning rate to use during training
momentum - momentum to be used during training when using 'sgd' optimizer
weight_decay - weight decay (L2 penalty) to use with selected optimizer
layer_sizes - define family classifier output head size and number of linear layers. Examples:
- [128,256,64]: the family classifier layers will be 3 with sizes 128, 256, 64 respectively
- [128,64]: the family classifier layers will be 2 with sizes 128, 64 respectively

Section contrastiveLearning

epochs - how many epochs to train the contrastive model for
train_split_proportion - proportion of the whole fresh dataset to use for training the contrastive model
valid_split_proportion - proportion of the whole fresh dataset to use for validating the contrastive model
test_split_proportion - proportion of the whole fresh dataset to use for testing the contrastive model
batch_size - how many samples per batch to load for the contrastive model
optimizer - optimizer to use during training. Possible values:
- adam: Adam algorithm
- sgd: stochastic gradient descent
lr - learning rate to use during training
momentum - momentum to be used during training when using 'sgd' optimizer
weight_decay - weight decay (L2 penalty) to use with selected optimizer
hard - online triplet mining function to use when training the model with contrastive learning. Possible values:
- 0: batch_all_triplet_loss online triplet mining function
- 1: batch_hard_triplet_loss online triplet mining function
margin - margin to use in the triplet loss
squared - whether (1) to use the squared euclidean norm as distance metric or the simple euclidean norm (0)
rank_size - size of the produced rankings
knn_k_min - minimum number of nearest neighbours to consider when classifying samples with k-NN algorithm (only odd numbers between knn_k_min and knn_k_max, included, will be used)
knn_k_max - maximum number of nearest neighbours to consider when classifying samples with k-NN algorithm (only odd numbers between knn_k_min and knn_k_max, included, will be used)

Back to top

Repository file structure

root/
|
├── src/
|   |
|   ├── FreshDatasetBuilder/
|   |   |
|   |   ├── emberFeatures/
|   |   |   |
|   |   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   |   ├── features.py  - - - - - - - - - - - - - - - (features python code 📖Wiki)
|   |   |   └── vectorize_features.py  - - - - - - - - - - (vectorize features python code 📖Wiki)
|   |   |
|   |   ├── utils/
|   |   |   |
|   |   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   |   ├── fresh_dataset_utils.py - - - - - - - - - - (fresh dataset utils python code 📖Wiki)
|   |   |   └── malware_bazaar_api.py  - - - - - - - - - - (malware bazaar API python code 📖Wiki)
|   |   |
|   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   └── build_fresh_dataset.py - - - - - - - - - - (fresh dataset builder python code 📖Wiki)
|   |
|   ├── Model/
|   |   |
|   |   ├── nets/
|   |   |   |
|   |   |   ├── generators/
|   |   |   |   |
|   |   |   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   |   |   ├── dataset.py - - - - - - - - - - - - - - - - (dataset (base) code 📖Wiki)
|   |   |   |   ├── dataset_alt.py - - - - - - - - - - - - - - (dataset_alt code 📖Wiki)
|   |   |   |   ├── fresh_dataset.py - - - - - - - - - - - - - (fresh_dataset code 📖Wiki)
|   |   |   |   ├── fresh_generators.py  - - - - - - - - - - - (fresh_generators code 📖Wiki)
|   |   |   |   ├── generators.py  - - - - - - - - - - - - - - (generators (base) code 📖Wiki)
|   |   |   |   ├── generators_alt1.py - - - - - - - - - - - - (generators_alt1 code 📖Wiki)
|   |   |   |   ├── generators_alt2.py - - - - - - - - - - - - (generators_alt2 code 📖Wiki)
|   |   |   |   └── generators_alt3.py - - - - - - - - - - - - (generators_alt3 code 📖Wiki)
|   |   |   |
|   |   |   ├── utils/
|   |   |   |   |
|   |   |   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   |   |   └── Net.py - - - - - - - - - - - - - - - - - - (Net code 📖Wiki)
|   |   |   |
|   |   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   |   ├── ALOHA_net.py - - - - - - - - - - - - - - - (ALOHA_net code 📖Wiki)
|   |   |   ├── Contrastive_Model_net.py - - - - - - - - - (Contrastive_Model_net code 📖Wiki)
|   |   |   ├── Family_Classifier_net.py - - - - - - - - - (Family_Classifier_net code 📖Wiki)
|   |   |   ├── MTJE_net.py  - - - - - - - - - - - - - - - (MTJE_net code 📖Wiki)
|   |   |   ├── MTJE_net_cosine.py - - - - - - - - - - - - (MTJE_net_cosine code 📖Wiki)
|   |   |   └── MTJE_net_pairwise_distance.py  - - - - - - (MTJE_net_pairwise_distance code 📖Wiki)
|   |   |
|   |   ├── utils/
|   |   |   |
|   |   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   |   ├── contrastive_utils.py - - - - - - - - - - - (contrastive_utils code 📖Wiki)
|   |   |   ├── opt_utils.py - - - - - - - - - - - - - - - (opt_utils code 📖Wiki)
|   |   |   ├── plot_utils.py  - - - - - - - - - - - - - - (plot_utils code 📖Wiki)
|   |   |   └── ranking_metrics.py - - - - - - - - - - - - (ranking_metrics code 📖Wiki)
|   |   |
|   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   ├── evaluate.py  - - - - - - - - - - - - - - - (evaluate code 📖Wiki)
|   |   ├── evaluate_contrastive.py  - - - - - - - - - (evaluate_contrastive code 📖Wiki)
|   |   ├── evaluate_family_classifier.py  - - - - - - (evaluate_family_classifier code 📖Wiki)
|   |   ├── evaluate_fresh.py  - - - - - - - - - - - - (evaluate_fresh code 📖Wiki)
|   |   ├── gen3_speed_evaluation.py - - - - - - - - - (gen3_speed_evaluation code 📖Wiki)
|   |   ├── plot.py  - - - - - - - - - - - - - - - - - (plot code 📖Wiki)
|   |   ├── plot_contrastive.py  - - - - - - - - - - - (plot_contrastive code 📖Wiki)
|   |   ├── plot_family_classifier.py  - - - - - - - - (plot_family_classifier code 📖Wiki)
|   |   ├── plot_fresh.py  - - - - - - - - - - - - - - (plot_fresh code 📖Wiki)
|   |   ├── train.py - - - - - - - - - - - - - - - - - (train code 📖Wiki)
|   |   ├── train_contrastive.py - - - - - - - - - - - (train_contrastive code 📖Wiki)
|   |   └── train_family_classifier.py - - - - - - - - (train_family_classifier code 📖Wiki)
|   |
|   ├── Sorel20mDataset/
|   |   |
|   |   ├── generators/
|   |   |   |
|   |   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   |   ├── sorel_dataset.py - - - - - - - - - - - - - (sorel_dataset code 📖Wiki)
|   |   |   └── sorel_generators.py  - - - - - - - - - - - (sorel_generators code 📖Wiki)
|   |   |
|   |   ├── utils/
|   |   |   |
|   |   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   |   ├── download_utils.py  - - - - - - - - - - - - (download_utils code 📖Wiki)
|   |   |   └── preproc_utils.py - - - - - - - - - - - - - (preproc_utils code 📖Wiki)
|   |   |
|   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   ├── preprocess_dataset.py  - - - - - - - - - - (preprocess_dataset code 📖Wiki)
|   |   ├── preprocess_ds_multi.py - - - - - - - - - - (preprocess_ds_multi code 📖Wiki)
|   |   └── sorel20mDownloader.py  - - - - - - - - - - (sorel20mDownloader code 📖Wiki)
|   |
|   ├── utils/
|   |   |
|   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   └── workflow_utils.py  - - - - - - - - - - - - - - - - - (workflow_utils code 📖Wiki)
|   |
|   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   ├── config.ini - - - - - - - - - - - - - - - - (configuration file 📖Wiki)
|   └── main.py  - - - - - - - - - - - - - - - - - (main code 📖Wiki)
|
├── MLproject  - - - - - - - - - - - - - - - - (MLproject file)
├── README.md  - - - - - - - - - - - - - - - - (README)
└── conda.yaml - - - - - - - - - - - - - - - - (conda yaml environment)

config.ini - cmikke97/Automatic-Malware-Signature-Generation GitHub Wiki

In this page

Configuration variables

Repository file structure

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️