Basic principles of DAS interpretation using dispty - Schlumberger/distpy GitHub Wiki

General concepts

distpy workflows begin by ingesting data from an external source into separate data chunks. These are stored as saved numpy matrices with (along-fibre, time) ordering. The name of each file is a unix-timestamp integer. Typically rapid prototyping of new interpretations begin from 1-second chunks of strain-rate data, but this is not imposed.

A JSON configuration file, describing a directed-graph network of signal processing steps is applied to every data chunk. This assumption of independence of the data chunks allows massively parallel and fully scalable processing. With DAS the ability to scale is a requirement due to the fantastically large data volumes (order 1/2 Gb per second on a 5km section of fibre). The signal processing will generally both manipulate the data to recover key information, and compress the result as a characteristic attribute or perceptual hash. In distpy these characteristic attributes and perceptual hashes usually recover a single value per depth point (or fewer if downsampling was part of the directed-graph).

A second JSON configuration file sets the system and offers optimization options such as setting the target number of cpus (or threads). Separating the system configuration in this way makes the directed-graphs more transportable, so that a directed-graph designed on a laptop with, say, 30-seconds of data; can also be used at full-scale and in real-time on 24/7 continuous record systems.

The project data structure

A key concept in distpy is that processing is captured in projects. Conceptually the project can be thought of as a directory name. This directory name is used in the archive, in the processing space, and in the output space.

For example, if we are working on linux using an on-prem filestore, the directories for the project myProject might be:

/archive/myProject/segy               - contains the strain-rate data in SEGY format
/scratch/myProject/data               - contains results of ingesting data for fast processing
/scratch/myProject/results/NoiseLog   - contains results of constructing Noise Logging FBE data

Whereas on PC with an external E: SSD drive, the same project might have

E:\myProject\segy               - contains the strain-rate data in SEGY format
C:\NotBackedUp\myProject\data               - contains results of ingesting data for fast processing
C:\NotBackedUp\myProject\results\NoiseLog   - contains results of constructing Noise Logging FBE data

The project data pipeline

Every dispty project can be considered as a sub-set of a single end-to-end pipeline. We have ingest-process-ingest-process which is characterized by the CASE00.py example script. In the first instance we are ingesting strain-rate data to a suitable (generally 1-second) chunk size. Then we process each chunk of data independently allowing massively parallel asynchronous compute. The results of that processing result in many possible summaries, which typically, but not always, are attributes with a single value at each point along the fibre. At this stage those results are separate, so whereas the initial ingest is often taking multi-second datasets and splitting them in to many 1-second data chunks; the second ingest step will take several hours of 1-second attributes and collect them into a single data chunk. That collected data chunk than then be further processed, often using similar signal processing techniques as the previous processing.

What distpy is not

distpy is not a high performance implementation. Enterprise implementations that target particular cloud architectures can consume the same directed-graph JSON files.

What distpy is

distpy is suitable for rapid prototyping of new workflows; it is flexibly deployable across Linux, Windows and Cloud; it is suitable for research into DAS; it is available under the permissive MIT License.