config.ini - cmikke97/Automatic-Malware-Signature-Generation GitHub Wiki
Tool configuration file. Modify values inside as needed.
Section general
-
device- desired device to train the model on, e.g. 'cuda:0' if a GPU is available, otherwise 'cpu' -
workers- number of workers to be used (if 0 -> set to current system cpu count) -
runs- number of training runs to do
Section sorel20mDataset
-
training_n_samples- max number of training data samples to use (if -1 -> takes all) -
validation_n_samples- max number of validation data samples to use (if -1 -> takes all) -
test_n_samples- max number of test data samples to use (if -1 -> takes all) -
validation_test_split- (should not be changed) timestamp that divides the validation data (used to check convergence/overfitting) from test data (used to assess final performance) -
train_validation_split- (should not be changed) timestamp that splits training data from validation data -
total_training_samples- (should not be changed) total number of available training samples in the original Sorel20M dataset -
total_validation_samples- (should not be changed) total number of available validation samples in the original Sorel20M dataset -
total_test_samples- (should not be changed) total number of available test samples in the original Sorel20M dataset
Section aloha
-
batch_size- how many samples per batch to load -
epochs- how many epochs to train for -
use_malicious_labels- whether or not (1/0) to use malware/benignware labels as a target -
use_count_labels- whether or not (1/0) to use the counts as an additional target -
use_tag_labels- whether or not (1/0) to use the tags as additional targets -
layer_sizes- define aloha net initial linear layers sizes (and amount). Examples:-
[512,512,128]: the initial layers (before the task branches) will be 3 with sizes 512, 512, 128 respectively -
[512,256]: the initial layers (before the task branches) will be 2 with sizes 512, 256 respectively
-
-
dropout_p- dropout probability between the first aloha net layers -
activation_function- activation function between the first aloha net layers. Possible values:-
elu: Exponential Linear Unit activation function -
leakyRelu: leaky Relu activation function -
pRelu: parametric Relu activation function (better to use this with weight decay = 0) -
relu: Rectified Linear Unit activation function
-
-
normalization_function- normalization function between the first aloha net layers. Possible values:-
layer_norm: the torch.nn.LayerNorm function -
batch_norm: the torch.nn.BatchNorm1d function
-
-
loss_weights- label weights to be used during loss calculation (Notice: only the weights corresponding to enabled labels will be used). Example:{'malware': 1.0, 'count': 0.1, 'tags': 1.0} -
optimizer- optimizer to use during training. Possible values:-
adam: Adam algorithm -
sgd: stochastic gradient descent
-
-
lr- learning rate to use during training -
momentum- momentum to be used during training when using 'sgd' optimizer -
weight_decay- weight decay (L2 penalty) to use with selected optimizer -
gen_type- generator type. Possible values are:-
base: use basic generator (from the original SOREL20M code) modified to work with the pre-processed dataset -
alt1: use alternative generator 1. Inspired by the 'index select' version of https://discuss.pytorch.org/t/dataloader-much-slower-than-manual-batching/27014/6, this version uses a new dataloader class, called FastTensorDataloader, to process tabular based data. It was modified from the original version available at the above link to be able to work with the pre-processed dataset (numpy memmap) and with multiple workers (in multiprocessing) -
alt2: use alternative generator 2. Inspired by the 'shuffle in-place' version of https://discuss.pytorch.org/t/dataloader-much-slower-than-manual-batching/27014/6, this version uses a new dataloader class, called FastTensorDataloader, to process tabular based data. It was modified from the original version available at the above link to be able to work with the pre-processed dataset (numpy memmap) and with multiple workers (in multiprocessing) -
alt3: use alternative generator 3. This version uses a new dataloader class, called FastTensorDataloader which asynchronously (if workers > 1) loads the dataset into memory in randomly chosen chunks which are concatenated together to form a 'chunk aggregate' -> the data inside a chunk aggregate is then shuffled. Finally batches of data are extracted from a chunk aggregate. The samples shuffling is therefore more localised but the loading speed is greatly increased
-
Section mtje
-
batch_size- how many samples per batch to load -
epochs- how many epochs to train for -
use_malicious_labels- whether or not (1/0) to use malware/benignware labels as a target -
use_count_labels- whether or not (1/0) to use the counts as an additional target -
layer_sizes- define mtje net initial linear layers sizes (and amount). Examples:-
[512,512,128]: the initial layers (before the task branches) will be 3 with sizes 512, 512, 128 respectively -
[512,256]: the initial layers (before the task branches) will be 2 with sizes 512, 256 respectively
-
-
dropout_p- dropout probability between the first mtje net layers -
activation_function- activation function between the first aloha net layers. Possible values:-
elu: Exponential Linear Unit activation function -
leakyRelu: leaky Relu activation function -
pRelu: parametric Relu activation function (better to use this with weight decay = 0) -
relu: Rectified Linear Unit activation function
-
-
normalization_function- normalization function between the first aloha net layers. Possible values:-
layer_norm: the torch.nn.LayerNorm function -
batch_norm: the torch.nn.BatchNorm1d function
-
-
loss_weights- label weights to be used during loss calculation (Notice: only the weights corresponding to enabled labels will be used). Example:{'malware': 1.0, 'count': 0.1, 'tags': 1.0} -
optimizer- optimizer to use during training. Possible values:-
adam: Adam algorithm -
sgd: stochastic gradient descent
-
-
lr- learning rate to use during training -
momentum- momentum to be used during training when using 'sgd' optimizer -
weight_decay- weight decay (L2 penalty) to use with selected optimizer -
gen_type- generator type. Possible values are:-
base: use basic generator (from the original SOREL20M code) modified to work with the pre-processed dataset -
alt1: use alternative generator 1. Inspired by the 'index select' version of https://discuss.pytorch.org/t/dataloader-much-slower-than-manual-batching/27014/6, this version uses a new dataloader class, called FastTensorDataloader, to process tabular based data. It was modified from the original version available at the above link to be able to work with the pre-processed dataset (numpy memmap) and with multiple workers (in multiprocessing) -
alt2: use alternative generator 2. Inspired by the 'shuffle in-place' version of https://discuss.pytorch.org/t/dataloader-much-slower-than-manual-batching/27014/6, this version uses a new dataloader class, called FastTensorDataloader, to process tabular based data. It was modified from the original version available at the above link to be able to work with the pre-processed dataset (numpy memmap) and with multiple workers (in multiprocessing) -
alt3: use alternative generator 3. This version uses a new dataloader class, called FastTensorDataloader which asynchronously (if workers > 1) loads the dataset into memory in randomly chosen chunks which are concatenated together to form a 'chunk aggregate' -> the data inside a chunk aggregate is then shuffled. Finally batches of data are extracted from a chunk aggregate. The samples shuffling is therefore more localised but the loading speed is greatly increased
-
-
similarity_measure- similarity measure used to evaluate distances in joint embedding space. Possible values are:-
dot: dot product between vectors in the embedding space. The similarity measure used in mtje paper -
cosine: cosine similarity between vectors in the embedding space -
pairwise_distance: calculates the pairwise distance and then transforms it to a similarity measure (between 0 and 1)
-
-
pairwise_distance_to_similarity_function- (IF 'pairwise_distance' IS SELECTED AS similarity_measure) - distance-to-similarity function to use. These functions will map values belonging to the R+ set (Real positives) to real values belonging to the [0,1] interval. Possible values are:-
exp: will compute e^(-x/a) -
inv: will compute 1/(1+x/a) -
inv_pow: will compute 1/(1+(x^2)/a)
where 'a' is a multiplicative factor (see 'pairwise_a')
-
-
pairwise_a- (IF 'pairwise_distance' IS SELECTED AS similarity_measure) - distance-to-similarity function 'a' multiplicative factor
Section freshDataset
-
families- malware Bazaar families of interest. NOTE: It is recommended to specify more families than 'number_of_families' since Malware Bazaar may not have 'amount_each' samples for some of them. These families will be considered in order. -
number_of_families- number of families to consider. The ones in excess, going in order, will not be considered. -
amount_each- amount of samples for each malware family to retrieve from Malware Bazaar -
n_queries- number of query samples per-family to consider -
min_n_anchor_samples- minimum number of anchor samples to use, per-family -
max_n_anchor_samples- maximum number of anchor samples to use, per-family -
n_evaluations- number of evaluations to perform (for uncertainty estimates)
Section familyClassifier
-
epochs- how many epochs to train the family classifier for -
train_split_proportion- proportion of the whole fresh dataset to use for training the family classifier -
valid_split_proportion- proportion of the whole fresh dataset to use for validating the family classifier -
test_split_proportion- proportion of the whole fresh dataset to use for testing the family classifier -
batch_size- how many samples per batch to load for the family classifier -
optimizer- optimizer to use during training. Possible values:-
adam: Adam algorithm -
sgd: stochastic gradient descent
-
-
lr- learning rate to use during training -
momentum- momentum to be used during training when using 'sgd' optimizer -
weight_decay- weight decay (L2 penalty) to use with selected optimizer -
layer_sizes- define family classifier output head size and number of linear layers. Examples:-
[128,256,64]: the family classifier layers will be 3 with sizes 128, 256, 64 respectively -
[128,64]: the family classifier layers will be 2 with sizes 128, 64 respectively
-
Section contrastiveLearning
-
epochs- how many epochs to train the contrastive model for -
train_split_proportion- proportion of the whole fresh dataset to use for training the contrastive model -
valid_split_proportion- proportion of the whole fresh dataset to use for validating the contrastive model -
test_split_proportion- proportion of the whole fresh dataset to use for testing the contrastive model -
batch_size- how many samples per batch to load for the contrastive model -
optimizer- optimizer to use during training. Possible values:-
adam: Adam algorithm -
sgd: stochastic gradient descent
-
-
lr- learning rate to use during training -
momentum- momentum to be used during training when using 'sgd' optimizer -
weight_decay- weight decay (L2 penalty) to use with selected optimizer -
hard- online triplet mining function to use when training the model with contrastive learning. Possible values:-
0: batch_all_triplet_loss online triplet mining function -
1: batch_hard_triplet_loss online triplet mining function
-
-
margin- margin to use in the triplet loss -
squared- whether (1) to use the squared euclidean norm as distance metric or the simple euclidean norm (0) -
rank_size- size of the produced rankings -
knn_k_min- minimum number of nearest neighbours to consider when classifying samples with k-NN algorithm (only odd numbers between knn_k_min and knn_k_max, included, will be used) -
knn_k_max- maximum number of nearest neighbours to consider when classifying samples with k-NN algorithm (only odd numbers between knn_k_min and knn_k_max, included, will be used)