Generating Datasets - VasSkliris/mly GitHub Wiki

Now that you have your injections at place you can create your first datasets. First I want to introduce the three modes of noise, or noise types that are implemented in the generators.

Modes of noise

Optimal is noise that the spectrum is following the characteristic PSD curve of every detector. This curve is a very simple smooth line. What that mode does is that loads that smooth curve in the frequency domain and adds gaussian perturbations to every frequency which makes it noisy in different ways each time. This mode is useful when you start playing with your networks because it has the same behaviour all the time. When you whiten this noise it will always give you back pure white noise. You can have no scientific results by using that noise for testing because it is not realistic.

Real is pure band-passed noise from the detectors. To use that you have to have segments from the detectors in the ligo_noise. For more information about how to organise the detector noise segments see the Detector noise. Real noise is the only type of noise you can use so that you scientific results with your networks. Although you will soon see that the amount of data available are not enough for our needs. For that reason we use time shifts (or lags).

Sudo real is a combination of optimal and real modes. To use that you have to have segments from the detectors in the ligo_noise. In this case the algorithm uses a real noise segment and calculates the PSD of the real noise. Based on that PSD it creates gaussian noise in the frequency domain and transforms it pack to the time domain. This way a real PSD can be used more that one time. This mode of noise is useful for creating bigger datasets with limited amount of real data. Again though you can have no scientific results by using that noise for testing because it is not realistic.

Generator functions

The two core generator functions are two:

  • data_generator_inj(parameters ,length ,fs ,size , arg*)

    This function creates a dataset with a specific type of injection. The parameters is list with three elemntets: [ source_file, noise_type ,SNR ], where the source file is the name of the file with the injections. For example if you need to make a dataset with CBCs that are in MLy_Workbench/injections/cbc/cbc_00 your source_file must be a string with the path from injections directory and after: 'cbcs/cbc_00'. The parameter noise_type will have to be one of the three modes of noise I introduced before. So it will be 'optimal','real' or 'sudo_real'. SNR will be a float with the combined SNR we want the injection to have. The other three inputs are the length in seconds of the injection, the sample frequency and the size of the dataset, which is how many injections do we want our dataset to have.

  • data_generator_noise(noise_type,length ,fs ,size, arg*)

    This function creates a dataset with just noise. The only parameter in this case is the noise_type. So it will be 'optimal','real' or 'sudo_real'. The other three are the same as before. The length in seconds of the only noise signal, the sample frequency and the size of the dataset, which is how many only noise signals do we want our dataset to have.

Let us see an example. If I want to generate a dataset with ringdown bursts of SNR=30, on optimal noise, from an injection file ringdown_00, with sample frequency 2048, length 2 seconds, that includes 10000 injections I will use the following command:

  • data_generator_inj(['bursts/ringdown_00', 'optimal' , 30] ,2 ,2048 ,10000)

If I want to create a set with optimal noise of the same characteristics I will use:

  • data_generator_noise('optimal',2 ,2048 ,10000)

Those functions will generate a .mat file that will be later used in the training. Moreover there are two more important arguments that those functions can get.

  • noise_file: This argument is a list with two strings that indicate which noise segment the function is going to use in the case of type of noise being not optimal. For example assuming that the user has downloaded detector data in the ligo_data file and we want to use the fourth segment on 25th of August 2017, this is the entry for that value:
    noise_file = ['20170825','SEG4_1187711176_22478s.txt']

  • lags: This argument goes together with noise_file and defaults to 1 (witch means no shifts). It is the amount of time-shifts we are going to use for the generation. Time-shifts are combinations of different instantiations of noise from different gps times. This helps us to create even more realistic test samples for our networks.

So far we have two functions that generate whatever we might need for the networks training. Although those functions are limited by the size of noise_file we use. What if we want to create many big datasets? What if we don't have enough big noise_files to do that?