Data Objects - VasSkliris/mly GitHub Wiki
One of the core goals of this package is to organise data into datasets that are going to train or test an ML algorithm. The building block of these datasets is called DataPod. Many DataPod objects put together create a DataSet object that is fed to a network. Here we go through their attributes and methods and most importantly the generator method of the DataSets.
DataPod is an object to concentrate all the information of a data. instance. The reason for their creation is to encapsulate all the useful attributes a machine learning algorithm will need. They are mutable and easy to modify.
-
strain: The main attribute of a DataPod is the strain, which is the main numerical data. It can have any shape although usually we expect timeseries related shapes. Data can be a numerical list, numpy array or gwpy TimeSeries. Every input is converted into a numpy.ndarray object and it is checked for inf or nan values.
-
fs (optional): The sample frequency has to be declared when we have a timeseries so that the duration of the data can be inferred. If the data are not timeseries it is ignored.
-
gps (optional): GPS time is a useful information in the case of real data. If not specified it is set to zero.
-
labels (optional): If the data are going to be used for a machine learning algorithm, labels are mandatory. Pods have a dictionary structure for labels regarding the information they hold. The default label is
'type'
usually referring to noise, signal, cbc etc. If not defined it gets the value{'type':'UNDEFINED'}
. -
detectors (optional): When data are related to detectors we can declare them in the same order as they are present in the data. If they are not defined the pod will assign N undefined detectors
'U'
, where N is the smallest dimension of the data. To declare them you can use a list or a string with any of the following charactersH,L,V,K,I,U
. -
duration (optional): If the dimensionality of the data is 1 or 2 duration will be assumed from fs, if not it is a mandatory parameter.
-
metadata (optional): A dictionary like labels that is used just for extra information.
All dataPod objects can perform all the basic numerical operations. Adding a number it will add it all elements. Adding an array will make element wise addition given that the shapes of the arrays are the same. All other operations follow the same rules.
-
save(name=None)
You can save the dataPod by using the save method.
-
name
is a string that can be the complete path name for the set to be saved. If name is not given, a random name will be generated and the dataset will be saved in the current directory.
-
-
load(filename)
Load an already saved DataPod object and assign it to a variable.
-
filename
is the full path of the dataSet file.
-
-
plot(type_=None)
This method returns a plot of the data colour coded based on the detectors.
-
type_
is an option you have if you want to plot one of the "extras" that the pod might have. These are 'psd' and 'correlation'. If not specified it will plot the strain.
-
DataSet objects are a collection of dataPods and also include operation to filter out parts of them, reshaping and inserting them in trainings and keeping information about these collectives.
-
dataPods (optional): The core of the dataset is a list of dataPod objects. A dataSet can be declared empty so dataPods do not need to be declared.
-
name (optional): A name for the dataSet.
To declare a dataSet object you can put dataPod objects in a list, then you can add more dataPods or another DataSet. There is the option of declaring a dataSet by filtering some pods from another DataSet. Finally and most importantly you can generate dataSet by giving arguments related to what do you want to generate.
-
save(name=None)
You can save the dataSet by using the save method.
-
name
is a string that can be the complete path name for the set to be saved. If name is not given, a random name will be generated and the dataset will be saved in the current directory.
-
-
load(filename)
Load an already saved DataSet object and assign it to a variable.
-
filename
is the full path of the dataSet file.
-
-
add(newData)
With the add method you can add to a preexisting dataSet object new dataPods or another dataSet object.
-
newData
can be a dataSet object, a dataPod or a list of dataPods.
Use example:
-
set1 = DataSet()
set2 = DataSet([<pod1>,<pod2>,<pod3>])
set1.add(set2) # adding another dataSet
set1.add(<pod3>) # adding another dataPod
-
fusion(elements, sizes=None, save=None)
The fusion method collects dataSet objects and makes them a single dataSet. Furthermore it can sample the dataSets up to a specific desired amount. Finally it can save these in a new dataSet if asked.
-
elements
is the list of the dataSets to be fused. The list can be dataSet objects already declared or the most common case the can be a list of file paths of the already saved DataSets. -
sizes
is an optional attribute to indicate how many instances you want from the dataSet you want to use in the fusion. If left empty it will load the whole dataSet. If you want to use part of only one dataSet you have to define the sizes of the others too. The list of sizes must be have the same size as the element list. -
save
is an optional attribute that indicates if the fused dataSet will also be saved. If you want to save the dataSet after fusion you have to give a file path in string form.
Use example:
-
# Declaring a dataset fuses elements from two other dataSets, 1000 from the first and 2000 from the other.
myset = DataSet.fusion(['myfile/set1.pkl','myfile/set2.pkl'] , [1000,2000])
-
unloadData(extras=None, shape=None)
Returns the data of the dataset specified.
-
extras
is one of the extra options saved in the metadata of the pod. If they exist you can use their name as a string to unload them instead. -
shape
is the desired shape you want to unload the data. Only transposing is allowed at this state.
-
-
unloadLabels(*args, reshape=False)
-
*args
can be any number of the key entries in the labels of datapods. If no arguments given it will return only the main label 'type'. -
reshape
is used if wanted only in the case only one label is unloaded in the arguments. It makes an shape (,N) to be (1,N).
-
This method puts together the generator functions of the older versions. It can be used to generate any type of timeseries data in the detectors given that the injections exist. Then final set it is ready to be fed to a machine learning algorithm.
To create a dataset you just use the following function:
myset = DataSet.generator(<arguments> ... )
Using the right arguments will give you the right dataSet. Here I will describe the arguments and what are they for. For examples about different dataSets you can check .
-
duration : The duration of the individual data you want to generate. Note here that this is independent of the injection duration you might use. It is suggested to use multiples or fractions of 2 in seconds.
-
fs : Sample Frequency of the data. Using fs and duration you can determine the size of each data instance.
-
size : This is the size of the dataset. The amount of instances you want to generate. Suggested numbers are below 10000. If you want to generate a big amount of data you can check for better ways.
-
detectors : This is the detectors we want to include in this dataset. You can use the initials of all the detectors H,L,V,K,I and U where U is undefined. The amount of detectors you include will increase the depth of the data. Each detector will be a channel. They follow the same rules with DataPod.detectors
-
injectionFolder (optional): If you want to use injections this is the path to the folder with the injections. Note here that the structure of the injection folder should follow the structure described in . If injectionFolder is set some of the following parameters will be not be optional anymore.
-
labels (optional): The labels that you set will have. You cannot have different types of the same label, but you can have multiple labels. It is a dictionary as the DataPod.labels. If not defined it gets the value
{'type':'UNDEFINED'}
. -
backgroundType (optional): Background type refers to the modes of noise (described in this link), that we use to create the detector background. If not specified, optimal will be selected. If not set to optimal, you will be asked to set a noise source file below.
-
injectionSNR (optional): The combined SNR value for all detectors. I you set injectionSNR to a value different than zero, injectionFolder will have to be set too. If not specified it will be set to 0 and it will give back pure noise dataset.
-
noiseSourceFile (optiona): If backgroundType is not optimal, you will have to use real data to generate your background. Those data have to be in the right file format as they described in .This argument is a list with two strings that indicate which noise segment the function is going to use in the case of type of noise being not optimal. For example assuming that the user has downloaded detector data in a file named '20170825' indicating the day and we want to use the fourth segment on this day, the value of this argument
noise_file = [<path to '20170825' > , <segment filename >]
. -
windowSize (optional): This is the window that encloses the injection the tis used to calculate the PSD of the background and use that for whitening. If not specified the default is 8 times the duration.
-
timeSlides (optional): Time slides are used when we have real noise or sudo-real noise. We shift the detectors in respect to each other to create different combinations of noise when we have more than one detectors. The default is 1 which means no shifts.
-
startingPoint (optional): This is the second from the beginning of a real noise segment that we start using the noise. Sometimes the beginning and the end of a segment have damping down oscillations and we want to avoid them. The default has the same value with windowSize.
-
name (optional): A name to give to the dataset. It is used if we want to save the set.
-
savePath (optional): To save the set we have to specify a destination directory. If you want to save it just specify a destination and it will use the name value as a filename. If the path is the current location use
'./'
. If left unspecified, it will not save the dataset. -
single (optional): Single indicates an option to use the injections as glitches. If set True, it will randomly select a detector and create only to that detector an injection with SNR equal to injectionSNR. All other detectors will be just noise. The default value is False
-
injectionCrop (optional): When injections are generated they are not happening at the same spot all the time to avoid overfitting. If the injection duration is equal to the duration of the data we want to generate the shifting is impossible without cropping part of the injection. For that reason we give control to the user to decide what precent of the injection they want to crop. The value of injectionCrop goes from 0 to 1. 1 is not suggested of course. A value of 0.5 will crop up to 50% of the first parts of an injection. Note that if the injection is cropped the SNR will be calculated based on the cropped injection. The default value is 0.
-
disposition (optional): Disposition is an argument that helps you taking control of how far away are the injection in each detectors relative to each other. When we have a real signal the sky position controls the delay times between detectors. We can manually change that if we want to trick a network or if we want to test coherency. Disposition takes numbers as values indicating a time within which all the injections will be. They will be distributed randomly within this time but with equal distance from each other. Disposition also takes a string value 'random' in case we want all the injections to be placed randomly in a different position for each detector. If it is not defined the default positioning applies. It is suggested to be used together with maxDuration argument when set to 'random'.
-
maxDuration (optional): This value is the maximum duration an injection can have from the injectionFolder. It is used only when disposition is active and set to 'random'. It prevents the generator from cropping signals. If not specified it is set equal to duration.
-
differentSignals (optional): If you want to have different injections to each detectors you set this to True. Default is False.
-
extras (optional): Extras are additional things to be generated for each data instance. They are saved in the DataPod.metadata dictionary. At this time you can choose any amount of three options:
-
'psd' : This will return also the PSD of each detector for each instance of data. The data will be an N rows numpy array where N is the number of detectors.
-
'snr' : This will return a list of N numbers that will be the individual SNR values of each injection if present. N is the number of detectors again.
-
'correlation' : This is the Pearson correlation for each pair of detectors. The detectors are shifted by the biggest distance two ground detectors can have, times the sample frequency. For fs = 1024 the shift will be -22, +22 giving a (k x 44) array where k are the number of possible combination of 2 out N.
-
The options have to be given in a list form.