Dev Tutorial : Extending distpy with new ingesters - Schlumberger/distpy GitHub Wiki

Writing a new HDF5 ingester

There are a lot of different HDF5 formats being used for DAS, so it's likely that you will come across one that is not in the current distpy package. This tutorial goes through the process of creating a distpy ingester from an unknown *.h5 file-type.

Analysing an HDF5 file

The HDF5 format always contains data stored in labelled blocks. This means that inside the file there is a structure similar to a directory structure.

To access this structure, distpy offers the 'distpy.io_help.h5_helpers()' function.

The following code snippet can be used to interrogate:

import distpy.io_help.h5_helpers
import h5py
import os

filename = 'myUnknownFormat.h5'
basepath =  'no_path_yet'

readObj = h5py.File(filename,'r')
distpy.io_help.h5_helpers.list_keys(readObj,basepath)

If the file follows the PRODML style from Energistics, the result should look something like:

/Acquisition
/Acquisition/FacilityCalibration[0]
/Acquisition/FacilityCalibration[0]/Calibration[0]
/Acquisition/FacilityCalibration[0]/Calibration[0]/LocusDepthPoint     (2486,)
/Acquisition/Raw[0]
/Acquisition/Raw[0]/RawData     (60000, 2486)
/Acquisition/Raw[0]/RawDataTime     (60000,)

This particular example doesn't show the pulse repetition frequency (PRF) or the depth spacing, but there is a depth axis called LocusDepthPoint and a time axis called RawDataTime, the RawData appears to be in (time,depth) ordering.

Recovering time and depth axes

We print a couple of values from the LocusDepthPoint and the RawDataTime to understand their format:

depths = readObj['/Acquisition/FacilityCalibration[0]/Calibration[0]/LocusDepthPoint']
times = readObj['/Acquisition/Raw[0]/RawDataTime']

print(depths[:2])
print(times[:2])

The example will be something like:

[(0, 0.        ,  0.  ) (1, 2.55099106, 16.17)]
[1571223853164308 1571223853164408]

Here we recognize that the spatial sampling in metres is given by the second value in a tuple from LocusDepthPoint and the time in microseconds is given as an integer in RawDataTime

We can get the PRF by differencing the first two samples in the time axis. Similarly we can get the depth axis by difference in the first two samples in the depth axis. To have a more robust approach, we will take the mean difference between pairs of samples.

The following code allows this:

time_vals = numpy.zeros((times.shape[0]))
for a in range(times.shape[0]):
    time_vals[a] = times[a]*1e-6
print('PRF = ',numpy.round(1.0/numpy.mean(time_vals[1:]-time_vals[:-1])))

depth_vals = numpy.zeros((depths.shape[0]))
for a in range(depths.shape[0]):
    tuple_val = depths[a]
    depth_vals[a] = tuple_val[1]
print('dx = ',numpy.mean(depth_vals[1:]-depth_vals[:-1]))

and in our example the result was:

PRF =  10000.0
dx =  2.5509910583496094

We can now see that this file contains 6 seconds of data, and we have the information we need to extend the h5_helpers.py

First we detect this file type by recognizing the header descriptors for time, depth and data - by adding what we've learned so far to the ingest_h5 function as a new type:

def ingest_h5(filename,basepath):
    readObj = h5py.File(filename,'r')
    
    # TYPE 0:
    depth = 'depth' in readObj
    #time  = '/Acquisition/Raw[0]/RawDataTime' in readObj
    data  = 'data' in readObj
    if depth and data:
        type0_h5(readObj,basepath,filename)
    
    # TYPE 1:
    depth = '/Acquisition/FacilityCalibration[0]/Calibration[0]/LocusDepthPoint' in readObj
    time  = '/Acquisition/Raw[0]/RawDataTime' in readObj
    data  = '/Acquisition/Raw[0]/RawData' in readObj
    if depth and time and data:
        type1_h5(readObj,basepath)
    readObj.close()

Then we add the function code to read the file

def type1_h5(readObj, basepath):
    depths = readObj['/Acquisition/FacilityCalibration[0]/Calibration[0]/LocusDepthPoint']
    times  = readObj['/Acquisition/Raw[0]/RawDataTime']
    
    # The data can be large - so don't read it all into memory at once.
    data_name = '/Acquisition/Raw[0]/RawData'

    time_vals = numpy.zeros((times.shape[0]))
    for a in range(times.shape[0]):
        time_vals[a] = times[a]*1e-6
    prf = numpy.round(1.0/numpy.mean(time_vals[1:]-time_vals[:-1]))


    depth_vals = numpy.zeros((depths.shape[0]))
    for a in range(depths.shape[0]):
        tuple_val = depths[a]
        depth_vals[a] = tuple_val[1]
    numpy.save(os.path.join(basepath,'measured_depth.npy'),depth_vals)

    unixtime = int(time_vals[0])
    print(str(unixtime))

    # How many 1 second files to write from this HDF5
    nsecs = int(times.shape[0]]/prf)
    ii=0
    for a in range(nsecs):
        fname = str(int(unixtime+a))+'.npy'
        fullPath = os.path.join(basepath,fname)
        if not os.path.exists(fullPath):
            # In distpy the processing assumes (depth,time) ordering, so we will be swapping the ordering as we read.
            numpy.save(fullPath,readObj[data_name][ii:ii+prf,:].transpose())
        else:
            print(fullPath,' exists, assuming restart run and skipping.')
        ii+=1

This way the new reader is available via the generic 'ingest_h5' function.