Reference : The disection of CASE00.py - Schlumberger/distpy GitHub Wiki
CASE00.py
can be found in the 'python_examples' directory and provides the skeleton example for running a distpy workflow. It can be used as a guide to running on Windows, Linux clusters, or Cloud.
Here we disect this base-level script. The script CASE00.py
is an overall controller example, suited to the case of a system with a head-node. For more distributed and automatically invoked environments see the Serverless Functions example.
Importing parts of the distpy package
The distpy package enforces the use of explict module imports, we have found that this provides robust behaviour across the many target hardware platforms, from laptop to cloud to Edge.
This means that in the top of CASE00.py
you will have:
import distpy.ingesters.parallel_ingest_sgy
import distpy.controllers.parallel_strainrate_processing
import distpy.ingesters.parallel_ingest_witsml
import distpy.controllers.parallel_plots
There is one import for each of the steps in the typical workflow:
- Ingesting data from an archive to a scratch or "hot" space for rapid asynchronous parallel processing. Step 1 could appear in a separate automatically invoked script (see example of Serverless Functions, where there is no controlling script)
- Apply the signal processing chain asynchronously to each chunk of data, possibly writing results to the scratch space, the archive or some intermediate project storage space.
- Some or all of the results from step 2 might be useful for plotting or further signal processing. This step ingests data from WITSML. For distributed temperature data, your first step might be this one.
- These ingested results can be plotted as waterfall plots or false colour images to aid visual interpretation.
The JSON configuration files
The distpy system can be flexibly configured to the target hardware. We separate the JSON configuration for hardware and storage from the JSON configuration for the signal processing chain. The system is separately configured for each of the steps, this is because there is no restriction on carrying out all the steps on the same hardware. For example, we can have step 1 carried out on the Edge, with the data chunked and uploaded asynchronously to the compute centre on the offshore platform step 2 is then a Linux cluster compute step with the results being pass to the cloud storage; step 3 might ingest from cloud storage to a "hot" cloud data space; where step 4 is triggered as a serverless function delivering results that can be viewed from anywhere. This type of hybrid scheme is expected, and so the small overhead of specifying the same compute configuration when all 4 steps are on the same target hardware is an acceptable price to pay for the flexibility we need.
A typical configuration file will have common elements such as "in_drive"
and "out_drive"
corresponding to the locations of "data"
and "results"
, together with a "project"
name. These can all be thought of as directories, in the case of non-directory based storage they are actually prefixes but when using distpy this distinction can be ignored.
There is also a "parallel_flag"
which can be set to 0
to disable parallel and receive more verbose output, which can be useful for testing, and a separate "ncpu"
which is actually the number of aimultaneous asynchronous threads, but can generally be thought of as the number of available processors.
To clarify, in the case of a Linux run, the following JSON:
{
"in_drive" : "/archive/projects/",
"out_drive" : "/scratch/myusername/",
"project" : "2020scr0001",
"data" : "sgy",
"results" : "sgy_data",
"parallel_flag" : 1,
"ncpu" : 128
}
translates to the following directory structure and hardware instruction:
INCOMING DATA: /archive/projects/2020prj0001/segy/*
OUTGOING RESULTS: /scratch/myusername/2020prj0001/sgy_data/*
EXECUTE IN PARALLEL USING 128 ASYNCHRONOUS PROCESSES
"ncpu"
A brief aside on When working with a new hardware configuration for the first time, it is worth testing each of the steps carefully. In particular the ingestion step is dominated by the file i/o, whereas the signal processing chain is dominated by compute. Depending on the design, your system may be very efficient at multiple asynchronous file i/o operations (so you can push up "ncpu"
for step 1) or could be very poor at this (in which case "ncpu"
set to 1 can be fastest as there is no risk of bottlenecking the file i/o). This is completely unrelated to the performance in step 2, where "ncpu"
is usually best set to equal the number of processors available, subject to available memory per compute node. So even in the case of a single-system run like "CASE00.py"
it is useful to have separate system configurations available for each step.
Additional parameters for strain-rate processing
When the data is ingested, we recover additional information. When working with SEG-Y files as the base storage, we always recover the information about the distance along the fibre and the frequency of the strainrate sampling (the pulse repetition frequency). However, with other file formats we have found this data can be separated from the archive in separate filesand so does not get ingested. This means that there is a "measured_depth"
option, which gives the name of a file containing hte distance-along-fibre information, this file must be located in the "data"
directory. If you have ingested with a distpy ingester this file will always be present and named "measured_depth.py"
. Similarly, the pulse repetition frequency is given in this configuration via the "prf"
attribute.
For small memory devices (both Edge and Serverless Functions tend to fall into this category) an array-size limiter can be applied using the "BOX_SIZE"
we note that this only applies to a sub-set of the available operations. The "fft"
and "multiple_calcs"
commands are fully supported for small memory devices, with more complicated workflows and particularly those involving 2D Fourier transforms we recommend testing carefully. There can be significant performance gains from setting the "BOX_SIZE"
even on large memory devices, this comes down to the efficiency of the underlying python numerical libraries such as numpy, the number 500
in the example below is efficient in most cases.
The final item, "config"
, gives an explicit path to the location of the signal processing configuration. The use of an explicit path here, even though the example places the file in the project location, is to allow a central store of standardized signal processing flows to be used.
To clarify, the following JSON:
{
"in_drive": "/scratch/myusername",
"out_drive": "/home/projects/",
"project": "2020prj0001",
"data": "sgy_data",
"results": "witsml",
"parallel_flag": 1,
"ncpu": 32,
"measured_depth": "measured_depth.npy",
"prf": 10000,
"BOX_SIZE": 500,
"config": "/home/projects/2020prj0001/config/strainrate2noiselog.json"
}
translates to the instructions:
INCOMING DATA: /scratch/myusername/2020prj0001/sgy_data/*
OUTGOING RESULTS: /home/projects/2020prj0001/witsml/*
EXECUTE IN PARALLEL USING 32 ASYNCHRONOUS PROCESSES
THE DISTANCE-ALONG-FIBRE FOR EACH POINT IN X IS GIVEN IN /scratch/myusername/2020prj0001/sgy_data/measured_depth.npy
THE PULSE-REPETITION-FREQUENCY IS 10000 Hz
FOR EFFICIENCY, WHERE AVAILABLE, PROCESS IN BATCHES OF 500 TRACES
USING THE SIGNAL PROCESSING FLOW DESCRIBED IN /home/projects/2020prj0001/config/strainrate2summary.json
The example strainrate2noiselog.json
performs a Noise Log interpretation flow, which is covered in the Azure tutorial:
{
"document" : 0,
"name" : "Standard 3-band Noise Log",
"description" : "Creates total energy outputs for the 200*, 600* and 1000* bands of a conventional Noise Log.",
"command_list" :
[
{ "name" : "fft", "uid" : 1, "in_uid" : 0 },
{ "name" : "rms_from_fft", "uid" : 2, "in_uid" : 1, "low_freq" : 0, "high_freq" : -1 },
{ "name" : "multiple_calcs", "uid" : 3, "in_uid" : 1, "func" : "te_from_fft", "low_freq" : [200,600,1000], "high_freq" : [-1,-1,-1] },
{ "name" : "write_witsml", "uid" : 4, "in_uid" : 3, "directory_out" : "NoiseLog", "low_freq" :[200,600,1000], "high_freq" : [-1,-1,-1], "gather_uids" : [2], "data_style" : "NoiseLog" }
]
}
What is of special interest here is the "write_witsml"
command and its "directory_out" : "NoiseLog"
attribute. This will push the results to a new sub-directory called NoiseLog
, so the WITSML output will be written to:
/home/projects/2020prj0001/witsml/NoiseLog/*
The aim of this scheme is to allow multiple types of output pushed to different sub-directories. Each output type can have multiple interptation attributes, for example the Noise Log case above will have 4 attributes: the total RMS and the 3 noise log total energy cuts.
Ingesting the WITSML
As might be expected, the WITSML ingestion JSON has a form that is identical to the original SEG-Y ingestion:
{
"in_drive" : "/home/projects/",
"out_drive" : "/scratch/myusername",
"project" : "2020prj0001",
"data" : "witsml",
"results" : "interpretation",
"parallel_flag" : 0,
"ncpu" : 1
}
Here we are taking every subdirectory of `/home/projects/2020prj0001/witsml/' where there will be many WITSML files (one per original data chunk), and writing numpy arrays consolidated by output type.
For clarity:
/home/projects/2020prj0001/witsml/NoiseLog/1234567.fbe
/home/projects/2020prj0001/witsml/NoiseLog/1234568.fbe
/home/projects/2020prj0001/witsml/NoiseLog/1234569.fbe
...
becomes
/scratch/myusername/2020prj0001/interpretation/NoiseLog/1234567/band_0-nyquist.npy
/scratch/myusername/2020prj0001/interpretation/NoiseLog/1234567/band_01_200-nyquist.npy
/scratch/myusername/2020prj0001/interpretation/NoiseLog/1234567/band_02_600-nyquist.npy
/scratch/myusername/2020prj0001/interpretation/NoiseLog/1234567/band_03_1000-nyquist.npy
/scratch/myusername/2020prj0001/interpretation/NoiseLog/1234567/measured_depth.npy
/scratch/myusername/2020prj0001/interpretation/NoiseLog/1234567/time.npy
Plotting in distpy
Since distpy is a lightweight data-handling and signal processing layer, plotting is not considered a core required functionality. There are many available viewers for WITSML and numpy file formats so there is not a strong need to augment what exists already. The plotting functions are, therefore, somewhat basic being suitable for quick-look annotated plots of results. They use the matplotlib package and no anti-aliasing is performed. Currently three types of plot exist: waterfall plots; false-colour stacks; and RGB false-colour.
The example `plots.json' illustrates all the existing plotting features:
{
"figure_size" : [ 36, 12],
"dpi" : 100,
"start_of_fibre" : 0,
"end_of_fibre" : 480,
"time_reference" : "time.npy",
"depth_reference" : "measured_depth.npy",
"depth_display_unit" : "ft",
"well_segments" : "/home/projects/jobInfo/segments.txt",
"label_list" : [
{"event_label_start" : 200, "event_mark" : "2015-06-28 13:20:00", "event_text" : "Well Shut in" },
{"event_label_start" : 300, "event_mark" : "2015-06-28 14:09:00", "event_text" : "Open Well 100 bbl/day" },
{"event_label_start" : 400, "event_mark" : "2015-06-28 14:20:00", "event_text" : "open well 450 bbl/day" },
{"event_label_start" : 600, "event_mark" : "2015-06-28 14:21:00", "event_text" : "Well Shut in" },
{"event_label_start" : 600, "event_mark" : "2015-06-28 14:25:00", "event_text" : "system stop" }],
"plots" : [
{
"plot_type" : "well_log",
"out_plot" : "ALL_BANDS",
"plot_list" : [
{ "title_text" : "Band 0", "data" : "band_0-nyquist.npy", "colormap" : "UAVK" },
{ "title_text" : "Band 1", "data" : "band_01_200--1.npy", "colormap" : "magma" },
{ "title_text" : "Band 2", "data" : "band_02_600--1.npy", "colormap" : "cividis" },
{ "title_text" : "Band 3", "data" : "band_03_1000--1.npy", "colormap" : "plasma" },
},
{
"plot_type" : "stack",
"out_plot" : "STACK",
"plot_list" : [
{ "title_text" : "Band 1", "data" : "band_01_200--1.npy", "colormap" : "viridis", "alpha" : "0.25"},
{ "title_text" : "Band 2", "data" : "band_02_600--1.npy", "colormap" : "plasma", "alpha" : "0.25"},
{ "title_text" : "Band 3", "data" : "band_03_1000--1.npy", "colormap" : "inferno", "alpha" : "0.25"} ]
},
{
"plot_type" : "rgb",
"out_plot" : "RGB",
"plot_list" : [
{ "title_text" : "Band 1", "data" : "band_01_200--1.npy", "inverted" : "yes"},
{ "title_text" : "Band 2", "data" : "band_02_600--1.npy", "inverted" : "yes"},
{ "title_text" : "Band 4", "data" : "band_04_1000--1.npy", "inverted" : "yes"} ]
}
]
}
The "figure_size"
and "dpi"
control the resolution and are passed through to matplotlib. The "start_of_fibre"
and "end_of_fibre"
allow the data to be clipped for plotting using the "depth_reference"
. The "well_segments"
file contains a list of significant zones on the well to highlight (for example perforated intervals). The form of the "well_segments"
file is:
MD Zone
ft
100.0 1
150.0
200.0 2
250.0
300.0 3
350.0
So here there are 3 Zones with the upper and lower measured depths given in feet. Unit conversion from feet to metres is handled in the plotting part of distpy only. In the signal processing the user must ensure consistent use of units.
The "label_list"
is a set of plot annotations described by times of interest, together with the well operation that occurred. For example:
'''
{"event_label_start" : 200, "event_mark" : "2015-06-28 13:20:00", "event_text" : "Well Shut in" },
'''
can be translated as:
DRAW A LINE ON THE PLOT AT THE DATESTAMP "2015-06-28 13:20:00"
NEXT TO THIS LINE AT THE WELLBORE DEPTH OF 200 IN THE DISPLAY UNITS,
DISPLAY THE TEXT "Well Shut in"
The list of "plots"
contains 3 plots. Each plot has an "out_plot"
name prefix (the plot-type is currently always PNG so the suffix is .png); a "plot_type"
which is "well_log"
for waterfall plots, "stack"
for stacked false-colour or "RGB"
for RGB false colour.
False colour images
The "stack"
option creates a composite by overlaying all of the plots in the list, with blending controlled by the '"alpha"` transparency.
The "RGB"
option creates a composite of 3 plots by assigning them respectively to the red, green and blue channels. The default is to use intensity scaled 0-1 across the image range; by settingin "inverted"
to "yes"
the intensity scale is inverted so that intensity (0,0,0) corresponds to white (i.e. RGB=(255,255,255)).