generators_alt3.py - cmikke97/Automatic-Malware-Signature-Generation GitHub Wiki
-
from multiprocessing import cpu_count- used to get the number of CPUs in the system - multiprocessing documentation -
from multiprocessing.pool import ThreadPool- pool of worker threads jobs can be submitted to - multiprocessing documentation
-
import numpy as np- the fundamental package for scientific computing with Python - numpy documentation -
import torch- tensor library like NumPy, with strong GPU support - pytorch documentation
from .dataset_alt import Dataset
get_chunks(tensors, chunk_indices, chunk_size, last_chunk_size, n_chunks, shuffle) (function) - Get 'n_chunks' chunks of 'chunk_size' consecutive samples from the dataset and concatenate them in a chunk aggregate ('chunk_agg'). The chunks to get are specified by a list of chunk indices that must be provided.
-
tensors(arg) - Dataset tensors -> S (shas, optional), X (features) and y (labels) -
chunk_indices(arg) - List containing the indices of the chunks to get from the dataset -
chunk_size(arg) - Size (in # of samples) of a single chunk of data -
last_chunk_size(arg) - Size (in # of samples) of the last chunk of data -
n_chunks(arg) - Number of chunks of data to retrieve from the dataset -
shuffle(arg) - Whether to shuffle the data at each iteration or not (default: False)
get_batch(chunk_agg, batch_size, i, return_malicious, return_counts, return_tags) (function) - Get a batch of data from a chunk aggregate.
-
chunk_agg(arg) - Chunk aggregate from get_chunks function -
batch_size(arg) - How many samples to load -
i(arg) - Current batch index -
return_malicious(arg) - Whether to return the malicious label for the data points or not (default: False) -
return_counts(arg) - Whether to return the counts for the data points or not (default: False) -
return_tags(arg) - Whether to return the tags for the data points or not (default: False)
get_chunks_unpack(args) (function) - Pass through function for unpacking get_chunks arguments.
-
args(arg) - Arguments dictionary
FastTensorDataLoader (class) - A DataLoader-like object for a set of tensors that can be much faster than TensorDataset + DataLoader because dataloader grabs individual indices of the dataset and calls cat (slow).
It asynchronously (if workers > 1) loads the dataset into memory in randomly chosen chunks which are concatenated together to form a 'chunk aggregate' -> the data inside a chunk aggregate is then shuffled.
Finally batches of data are extracted from a chunk aggregate. The samples shuffling is therefore more localised but the loading speed is greatly increased.
-
__init__(self, *tensors, batch_size, chunk_size, chunks, shuffle, num_workers, use_malicious_labels, use_count_labels, use_tag_labels)(member function) - Initialize FastTensorDataLoader class.-
*tensors(arg) - Dataset Tensors. Must have the same length @ dim 0 -
batch_size(arg) - Size of the batches to load (default: BATCH_SIZE) -
chunk_size(arg) - Size (in # of samples) of a single chunk of data (default: CHUNK_SIZE) -
chunks(arg) - Number of chunks of data to retrieve from the dataset (default: CHUNKS) -
shuffle(arg) - If True, shuffle the data whenever an iterator is created out of this object (default: False) -
num_workers(arg) - How many workers (threads) to use for data loading (default: None -> 1 worker) -
use_malicious_labels(arg) - Whether to return the malicious label for the data points or not (default: False) -
use_count_labels(arg) - Whether to return the counts for the data points or not (default: False) -
use_tag_labels(arg) - Whether to return the tags for the data points or not (default: False)
-
-
__del__(self)(member function) - FastTensorDataLoader destructor. -
__iter__(self)(member function) - Returns the FastTensorDataLoader (dataset iterator) itself. -
__next__(self)(member function) - Get next batch of data. -
__len__(self)(member function) - Get FastDataLoader length (number of batches).
GeneratorFactory (class) - Generator factory class.
-
__init__(self, ds_root, batch_size, chunk_size, chunks, mode, num_workers, n_samples, use_malicious_labels, use_count_labels, use_tag_labels, return_shas, shuffle)(member function) - Initialize generator factory class.-
ds_root(arg) - Path of the directory where to find the pre-processed dataset (containing .dat files) -
batch_size(arg) - How many samples per batch to load (default: BATCH_SIZE) -
chunk_size(arg) - Size (in # of samples) of a single chunk of data (default: CHUNK_SIZE) -
chunks(arg) - Number of chunks of data to retrieve from the dataset (default: CHUNKS) -
mode(arg) - Mode of use of the dataset object (it may be 'train', 'validation' or 'test') (default: 'train') -
num_workers(arg) - How many workers (threads) to use for data loading (default: max_workers) -
n_samples(arg) - Number of samples to consider (used just to access the right pre-processed files) (default: None -> all) -
use_malicious_labels(arg) - Whether to return the malicious label for the data points or not (default: False) -
use_count_labels(arg) - Whether to return the counts for the data points or not (default: False) -
use_tag_labels(arg) - Whether to return the tags for the data points or not (default: False) -
return_shas(arg) - Whether to return the sha256 of the data points or not (default: False) -
shuffle(arg) - Set to True to have the data reshuffled at every epoch (default: None -> if mode is 'train' then shuffle is set to True, otherwise it is set to False)
-
-
__call__(self)(member function) - Generator-factory call method.
get_generator(ds_root, batch_size, chunk_size, chunks, mode, num_workers, n_samples, use_malicious_labels, use_count_labels, use_tag_labels, return_shas, shuffle) (function) - Get generator based on the provided arguments.
-
ds_root(arg) - Path of the directory where to find the pre-processed dataset (containing .dat files) -
batch_size(arg) - How many samples per batch to load (default: BATCH_SIZE) -
chunk_size(arg) - Size (in # of samples) of a single chunk of data (default: CHUNK_SIZE) -
chunks(arg) - Number of chunks of data to retrieve from the dataset (default: CHUNKS) -
mode(arg) - Mode of use of the dataset object (may be 'train', 'validation' or 'test') (default: 'train') -
num_workers(arg) - How many workers (threads) to use for data loading (if None -> set to current system cpu count) (default: None) -
n_samples(arg) - Number of samples to consider (used just to access the right pre-processed files) (default: None -> all) -
use_malicious_labels(arg) - Whether to return the malicious label for the data points or not (default: True) -
use_count_labels(arg) - Whether to return the counts for the data points or not (default: True) -
use_tag_labels(arg) - Whether to return the tags for the data points or not (default: True) -
return_shas(arg) - Whether to return the sha256 of the data points or not (default: False) -
shuffle(arg) - Set to True to have the data reshuffled at every epoch (default: None -> if mode is 'train' then shuffle is set to True, otherwise it is set to False)