Dev Tutorial : Extending distpy with new ingesters - Schlumberger/distpy GitHub Wiki
Writing a new HDF5 ingester
There are a lot of different HDF5 formats being used for DAS, so it's likely that you will come across one that is not in the current distpy package. This tutorial goes through the process of creating a distpy ingester from an unknown *.h5 file-type.
Analysing an HDF5 file
The HDF5 format always contains data stored in labelled blocks. This means that inside the file there is a structure similar to a directory structure.
To access this structure, distpy offers the 'distpy.io_help.h5_helpers()' function.
The following code snippet can be used to interrogate:
import distpy.io_help.h5_helpers
import h5py
import os
filename = 'myUnknownFormat.h5'
basepath = 'no_path_yet'
readObj = h5py.File(filename,'r')
distpy.io_help.h5_helpers.list_keys(readObj,basepath)
If the file follows the PRODML style from Energistics, the result should look something like:
/Acquisition
/Acquisition/FacilityCalibration[0]
/Acquisition/FacilityCalibration[0]/Calibration[0]
/Acquisition/FacilityCalibration[0]/Calibration[0]/LocusDepthPoint (2486,)
/Acquisition/Raw[0]
/Acquisition/Raw[0]/RawData (60000, 2486)
/Acquisition/Raw[0]/RawDataTime (60000,)
This particular example doesn't show the pulse repetition frequency (PRF) or the depth spacing, but there is a depth axis called LocusDepthPoint
and a time axis called RawDataTime
, the RawData
appears to be in (time,depth) ordering.
Recovering time and depth axes
We print a couple of values from the LocusDepthPoint
and the RawDataTime
to understand their format:
depths = readObj['/Acquisition/FacilityCalibration[0]/Calibration[0]/LocusDepthPoint']
times = readObj['/Acquisition/Raw[0]/RawDataTime']
print(depths[:2])
print(times[:2])
The example will be something like:
[(0, 0. , 0. ) (1, 2.55099106, 16.17)]
[1571223853164308 1571223853164408]
Here we recognize that the spatial sampling in metres is given by the second value in a tuple from LocusDepthPoint
and the time in microseconds is given as an integer in RawDataTime
We can get the PRF by differencing the first two samples in the time axis. Similarly we can get the depth axis by difference in the first two samples in the depth axis. To have a more robust approach, we will take the mean difference between pairs of samples.
The following code allows this:
time_vals = numpy.zeros((times.shape[0]))
for a in range(times.shape[0]):
time_vals[a] = times[a]*1e-6
print('PRF = ',numpy.round(1.0/numpy.mean(time_vals[1:]-time_vals[:-1])))
depth_vals = numpy.zeros((depths.shape[0]))
for a in range(depths.shape[0]):
tuple_val = depths[a]
depth_vals[a] = tuple_val[1]
print('dx = ',numpy.mean(depth_vals[1:]-depth_vals[:-1]))
and in our example the result was:
PRF = 10000.0
dx = 2.5509910583496094
We can now see that this file contains 6 seconds of data, and we have the information we need to extend the h5_helpers.py
First we detect this file type by recognizing the header descriptors for time, depth and data - by adding what we've learned so far to the ingest_h5
function as a new type:
def ingest_h5(filename,basepath):
readObj = h5py.File(filename,'r')
# TYPE 0:
depth = 'depth' in readObj
#time = '/Acquisition/Raw[0]/RawDataTime' in readObj
data = 'data' in readObj
if depth and data:
type0_h5(readObj,basepath,filename)
# TYPE 1:
depth = '/Acquisition/FacilityCalibration[0]/Calibration[0]/LocusDepthPoint' in readObj
time = '/Acquisition/Raw[0]/RawDataTime' in readObj
data = '/Acquisition/Raw[0]/RawData' in readObj
if depth and time and data:
type1_h5(readObj,basepath)
readObj.close()
Then we add the function code to read the file
def type1_h5(readObj, basepath):
depths = readObj['/Acquisition/FacilityCalibration[0]/Calibration[0]/LocusDepthPoint']
times = readObj['/Acquisition/Raw[0]/RawDataTime']
# The data can be large - so don't read it all into memory at once.
data_name = '/Acquisition/Raw[0]/RawData'
time_vals = numpy.zeros((times.shape[0]))
for a in range(times.shape[0]):
time_vals[a] = times[a]*1e-6
prf = numpy.round(1.0/numpy.mean(time_vals[1:]-time_vals[:-1]))
depth_vals = numpy.zeros((depths.shape[0]))
for a in range(depths.shape[0]):
tuple_val = depths[a]
depth_vals[a] = tuple_val[1]
numpy.save(os.path.join(basepath,'measured_depth.npy'),depth_vals)
unixtime = int(time_vals[0])
print(str(unixtime))
# How many 1 second files to write from this HDF5
nsecs = int(times.shape[0]]/prf)
ii=0
for a in range(nsecs):
fname = str(int(unixtime+a))+'.npy'
fullPath = os.path.join(basepath,fname)
if not os.path.exists(fullPath):
# In distpy the processing assumes (depth,time) ordering, so we will be swapping the ordering as we read.
numpy.save(fullPath,readObj[data_name][ii:ii+prf,:].transpose())
else:
print(fullPath,' exists, assuming restart run and skipping.')
ii+=1
This way the new reader is available via the generic 'ingest_h5' function.