generators_alt3.py - cmikke97/Automatic-Malware-Signature-Generation GitHub Wiki

In this page

Imported Modules

from multiprocessing import cpu_count - used to get the number of CPUs in the system - multiprocessing documentation
from multiprocessing.pool import ThreadPool - pool of worker threads jobs can be submitted to - multiprocessing documentation

import numpy as np - the fundamental package for scientific computing with Python - numpy documentation
import torch - tensor library like NumPy, with strong GPU support - pytorch documentation

from .dataset_alt import Dataset

Classes and functions

get_chunks(tensors, chunk_indices, chunk_size, last_chunk_size, n_chunks, shuffle) (function) - Get 'n_chunks' chunks of 'chunk_size' consecutive samples from the dataset and concatenate them in a chunk aggregate ('chunk_agg'). The chunks to get are specified by a list of chunk indices that must be provided.

tensors (arg) - Dataset tensors -> S (shas, optional), X (features) and y (labels)
chunk_indices (arg) - List containing the indices of the chunks to get from the dataset
chunk_size (arg) - Size (in # of samples) of a single chunk of data
last_chunk_size (arg) - Size (in # of samples) of the last chunk of data
n_chunks (arg) - Number of chunks of data to retrieve from the dataset
shuffle (arg) - Whether to shuffle the data at each iteration or not (default: False)

get_batch(chunk_agg, batch_size, i, return_malicious, return_counts, return_tags) (function) - Get a batch of data from a chunk aggregate.

chunk_agg (arg) - Chunk aggregate from get_chunks function
batch_size (arg) - How many samples to load
i (arg) - Current batch index
return_malicious (arg) - Whether to return the malicious label for the data points or not (default: False)
return_counts (arg) - Whether to return the counts for the data points or not (default: False)
return_tags (arg) - Whether to return the tags for the data points or not (default: False)

get_chunks_unpack(args) (function) - Pass through function for unpacking get_chunks arguments.

args (arg) - Arguments dictionary

FastTensorDataLoader (class) - A DataLoader-like object for a set of tensors that can be much faster than TensorDataset + DataLoader because dataloader grabs individual indices of the dataset and calls cat (slow).

It asynchronously (if workers > 1) loads the dataset into memory in randomly chosen chunks which are concatenated together to form a 'chunk aggregate' -> the data inside a chunk aggregate is then shuffled.

Finally batches of data are extracted from a chunk aggregate. The samples shuffling is therefore more localised but the loading speed is greatly increased.

__init__(self, *tensors, batch_size, chunk_size, chunks, shuffle, num_workers, use_malicious_labels, use_count_labels, use_tag_labels) (member function) - Initialize FastTensorDataLoader class.
- *tensors (arg) - Dataset Tensors. Must have the same length @ dim 0
- batch_size (arg) - Size of the batches to load (default: BATCH_SIZE)
- chunk_size (arg) - Size (in # of samples) of a single chunk of data (default: CHUNK_SIZE)
- chunks (arg) - Number of chunks of data to retrieve from the dataset (default: CHUNKS)
- shuffle (arg) - If True, shuffle the data whenever an iterator is created out of this object (default: False)
- num_workers (arg) - How many workers (threads) to use for data loading (default: None -> 1 worker)
- use_malicious_labels (arg) - Whether to return the malicious label for the data points or not (default: False)
- use_count_labels (arg) - Whether to return the counts for the data points or not (default: False)
- use_tag_labels (arg) - Whether to return the tags for the data points or not (default: False)
__del__(self) (member function) - FastTensorDataLoader destructor.
__iter__(self) (member function) - Returns the FastTensorDataLoader (dataset iterator) itself.
__next__(self) (member function) - Get next batch of data.
__len__(self) (member function) - Get FastDataLoader length (number of batches).

GeneratorFactory (class) - Generator factory class.

__init__(self, ds_root, batch_size, chunk_size, chunks, mode, num_workers, n_samples, use_malicious_labels, use_count_labels, use_tag_labels, return_shas, shuffle) (member function) - Initialize generator factory class.
- ds_root (arg) - Path of the directory where to find the pre-processed dataset (containing .dat files)
- batch_size (arg) - How many samples per batch to load (default: BATCH_SIZE)
- chunk_size (arg) - Size (in # of samples) of a single chunk of data (default: CHUNK_SIZE)
- chunks (arg) - Number of chunks of data to retrieve from the dataset (default: CHUNKS)
- mode (arg) - Mode of use of the dataset object (it may be 'train', 'validation' or 'test') (default: 'train')
- num_workers (arg) - How many workers (threads) to use for data loading (default: max_workers)
- n_samples (arg) - Number of samples to consider (used just to access the right pre-processed files) (default: None -> all)
- use_malicious_labels (arg) - Whether to return the malicious label for the data points or not (default: False)
- use_count_labels (arg) - Whether to return the counts for the data points or not (default: False)
- use_tag_labels (arg) - Whether to return the tags for the data points or not (default: False)
- return_shas (arg) - Whether to return the sha256 of the data points or not (default: False)
- shuffle (arg) - Set to True to have the data reshuffled at every epoch (default: None -> if mode is 'train' then shuffle is set to True, otherwise it is set to False)
__call__(self) (member function) - Generator-factory call method.

get_generator(ds_root, batch_size, chunk_size, chunks, mode, num_workers, n_samples, use_malicious_labels, use_count_labels, use_tag_labels, return_shas, shuffle) (function) - Get generator based on the provided arguments.

ds_root (arg) - Path of the directory where to find the pre-processed dataset (containing .dat files)
batch_size (arg) - How many samples per batch to load (default: BATCH_SIZE)
chunk_size (arg) - Size (in # of samples) of a single chunk of data (default: CHUNK_SIZE)
chunks (arg) - Number of chunks of data to retrieve from the dataset (default: CHUNKS)
mode (arg) - Mode of use of the dataset object (may be 'train', 'validation' or 'test') (default: 'train')
num_workers (arg) - How many workers (threads) to use for data loading (if None -> set to current system cpu count) (default: None)
n_samples (arg) - Number of samples to consider (used just to access the right pre-processed files) (default: None -> all)
use_malicious_labels (arg) - Whether to return the malicious label for the data points or not (default: True)
use_count_labels (arg) - Whether to return the counts for the data points or not (default: True)
use_tag_labels (arg) - Whether to return the tags for the data points or not (default: True)
return_shas (arg) - Whether to return the sha256 of the data points or not (default: False)
shuffle (arg) - Set to True to have the data reshuffled at every epoch (default: None -> if mode is 'train' then shuffle is set to True, otherwise it is set to False)

Back to top

Repository file structure

root/
|
├── src/
|   |
|   ├── FreshDatasetBuilder/
|   |   |
|   |   ├── emberFeatures/
|   |   |   |
|   |   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   |   ├── features.py  - - - - - - - - - - - - - - - (features python code 📖Wiki)
|   |   |   └── vectorize_features.py  - - - - - - - - - - (vectorize features python code 📖Wiki)
|   |   |
|   |   ├── utils/
|   |   |   |
|   |   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   |   ├── fresh_dataset_utils.py - - - - - - - - - - (fresh dataset utils python code 📖Wiki)
|   |   |   └── malware_bazaar_api.py  - - - - - - - - - - (malware bazaar API python code 📖Wiki)
|   |   |
|   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   └── build_fresh_dataset.py - - - - - - - - - - (fresh dataset builder python code 📖Wiki)
|   |
|   ├── Model/
|   |   |
|   |   ├── nets/
|   |   |   |
|   |   |   ├── generators/
|   |   |   |   |
|   |   |   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   |   |   ├── dataset.py - - - - - - - - - - - - - - - - (dataset (base) code 📖Wiki)
|   |   |   |   ├── dataset_alt.py - - - - - - - - - - - - - - (dataset_alt code 📖Wiki)
|   |   |   |   ├── fresh_dataset.py - - - - - - - - - - - - - (fresh_dataset code 📖Wiki)
|   |   |   |   ├── fresh_generators.py  - - - - - - - - - - - (fresh_generators code 📖Wiki)
|   |   |   |   ├── generators.py  - - - - - - - - - - - - - - (generators (base) code 📖Wiki)
|   |   |   |   ├── generators_alt1.py - - - - - - - - - - - - (generators_alt1 code 📖Wiki)
|   |   |   |   ├── generators_alt2.py - - - - - - - - - - - - (generators_alt2 code 📖Wiki)
|   |   |   |   └── generators_alt3.py - - - - - - - - - - - - (generators_alt3 code 📖Wiki)
|   |   |   |
|   |   |   ├── utils/
|   |   |   |   |
|   |   |   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   |   |   └── Net.py - - - - - - - - - - - - - - - - - - (Net code 📖Wiki)
|   |   |   |
|   |   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   |   ├── ALOHA_net.py - - - - - - - - - - - - - - - (ALOHA_net code 📖Wiki)
|   |   |   ├── Contrastive_Model_net.py - - - - - - - - - (Contrastive_Model_net code 📖Wiki)
|   |   |   ├── Family_Classifier_net.py - - - - - - - - - (Family_Classifier_net code 📖Wiki)
|   |   |   ├── MTJE_net.py  - - - - - - - - - - - - - - - (MTJE_net code 📖Wiki)
|   |   |   ├── MTJE_net_cosine.py - - - - - - - - - - - - (MTJE_net_cosine code 📖Wiki)
|   |   |   └── MTJE_net_pairwise_distance.py  - - - - - - (MTJE_net_pairwise_distance code 📖Wiki)
|   |   |
|   |   ├── utils/
|   |   |   |
|   |   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   |   ├── contrastive_utils.py - - - - - - - - - - - (contrastive_utils code 📖Wiki)
|   |   |   ├── opt_utils.py - - - - - - - - - - - - - - - (opt_utils code 📖Wiki)
|   |   |   ├── plot_utils.py  - - - - - - - - - - - - - - (plot_utils code 📖Wiki)
|   |   |   └── ranking_metrics.py - - - - - - - - - - - - (ranking_metrics code 📖Wiki)
|   |   |
|   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   ├── evaluate.py  - - - - - - - - - - - - - - - (evaluate code 📖Wiki)
|   |   ├── evaluate_contrastive.py  - - - - - - - - - (evaluate_contrastive code 📖Wiki)
|   |   ├── evaluate_family_classifier.py  - - - - - - (evaluate_family_classifier code 📖Wiki)
|   |   ├── evaluate_fresh.py  - - - - - - - - - - - - (evaluate_fresh code 📖Wiki)
|   |   ├── gen3_speed_evaluation.py - - - - - - - - - (gen3_speed_evaluation code 📖Wiki)
|   |   ├── plot.py  - - - - - - - - - - - - - - - - - (plot code 📖Wiki)
|   |   ├── plot_contrastive.py  - - - - - - - - - - - (plot_contrastive code 📖Wiki)
|   |   ├── plot_family_classifier.py  - - - - - - - - (plot_family_classifier code 📖Wiki)
|   |   ├── plot_fresh.py  - - - - - - - - - - - - - - (plot_fresh code 📖Wiki)
|   |   ├── train.py - - - - - - - - - - - - - - - - - (train code 📖Wiki)
|   |   ├── train_contrastive.py - - - - - - - - - - - (train_contrastive code 📖Wiki)
|   |   └── train_family_classifier.py - - - - - - - - (train_family_classifier code 📖Wiki)
|   |
|   ├── Sorel20mDataset/
|   |   |
|   |   ├── generators/
|   |   |   |
|   |   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   |   ├── sorel_dataset.py - - - - - - - - - - - - - (sorel_dataset code 📖Wiki)
|   |   |   └── sorel_generators.py  - - - - - - - - - - - (sorel_generators code 📖Wiki)
|   |   |
|   |   ├── utils/
|   |   |   |
|   |   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   |   ├── download_utils.py  - - - - - - - - - - - - (download_utils code 📖Wiki)
|   |   |   └── preproc_utils.py - - - - - - - - - - - - - (preproc_utils code 📖Wiki)
|   |   |
|   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   ├── preprocess_dataset.py  - - - - - - - - - - (preprocess_dataset code 📖Wiki)
|   |   ├── preprocess_ds_multi.py - - - - - - - - - - (preprocess_ds_multi code 📖Wiki)
|   |   └── sorel20mDownloader.py  - - - - - - - - - - (sorel20mDownloader code 📖Wiki)
|   |
|   ├── utils/
|   |   |
|   |   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   |   └── workflow_utils.py  - - - - - - - - - - - - - - - - - (workflow_utils code 📖Wiki)
|   |
|   ├── __init__.py  - - - - - - - - - - - - - - - (python module init)
|   ├── config.ini - - - - - - - - - - - - - - - - (configuration file 📖Wiki)
|   └── main.py  - - - - - - - - - - - - - - - - - (main code 📖Wiki)
|
├── MLproject  - - - - - - - - - - - - - - - - (MLproject file)
├── README.md  - - - - - - - - - - - - - - - - (README)
└── conda.yaml - - - - - - - - - - - - - - - - (conda yaml environment)