Analysis concepts - QutEcoacoustics/baw-server GitHub Wiki

Our analysis system is designed to use a PBS cluster to execute jobs.

The following is an overview of how path mappings work between host nodes, servers, and containers.

In this scenario there are four filesystems we have to contend with:

Host: The host for the baw-server
- we assume the filesystems of the web servers and the workers is functionally equivalent
App: The container for the baw-server
Cluster: The PBS file system
- We assume the filesystems on the head node and the worker nodes in the cluster are functionally equivalent
Analysis: The file system of whatever analysis container is run (if used)

How these various paths and working directories map between these filesystems is the subject of this article.

We'll work with some sample data:

Instance: ecosounds
API URL: api.ecosounds.org
Analysis Job Id: 1
Analysis Job Item Id: 7890
Audio Recording ID: 1234
Audio Recording UUID: abcdef01-23456789a-bcde-f1234567890a
Audio Recording recorded at: 2022-12-15T09:30:00+10:00
Site: SERF NE

The `source` path

The source path is the path to the file to be analyzed. Even if the original audio path is available cross filesystem, we won't use it because:

we want nice names (not guids)
we want our download counter to increment
we don't want to expose writeable access to original audio directory
we're more flexible in the long run if we decouple our file systems

Mapping

Host: N/A
App: N/A
Cluster: $TMPDIR/source/20221215T093000+1000_SERF_NE_1234.wav
- e.g. /data1/pbs.3329590.pbs/source/20221215T093000+1000_SERF_NE_1234.wav
- downloaded on job start with something like curl -OJ https://api.ecosounds.org/audio_recordings/1234/original
- deleted at end of job
Container: /data/source/20221215T093000+1000_SERF_NE_1234.wav via a bind mount from $TMPDIR/source/ to /data/source

The `output` directory

The output directory is the folder where we want analysis results to go. Unlike the source, it is mapped directly to the file system.

Mapping

Host: {root}/analysis_results/{analysis_job_id}/{first_two_uuid}/{uuid}/
- example: /work/a2o/analysis_results/1/ab/abcdef01-23456789a-bcde-f1234567890a/
App: {root}/analysis_results/{analysis_job_id}/{first_two_uuid}/{uuid}/
- example: /data/analysis_results/1/ab/abcdef01-23456789a-bcde-f1234567890a/
Cluster: {root}/analysis_results/{analysis_job_id}/{first_two_uuid}/{uuid}/
- example: /work/a2o/analysis_results/1/ab/abcdef01-23456789a-bcde-f1234567890a/
Container: /data/output

The `config` path

We allow each analysis to be customized with a configuration file. Similar to the audio recording we download the config file when needed.

Mapping

Host: N/A
App: N/A
Cluster: $TMPDIR/config/{config_file}
- example: /data1/pbs.3329590.pbs/config/Towsey.Acoustic.yml
- downloaded on job start with something like curl https://api.ecosounds.org/analysis_jobs/1 | jq ... | Towsey.Acoustic.yml
- deleted at end of job
Container: /data/config/Towsey.Acoustic.yml via a bind mount from $TMPDIR/config/ to /data/config

The `tmp` directory

A dedicated space to write temporary files. Deleted after the job is run.

Mapping

Host: N/A
App: N/A
Cluster: $TMPDIR/tmp
- example: /data1/pbs.3329590.pbs/tmp
Container: /data/tmp via a bind mount from $TMPDIR/tmp to /data/tmp

The working directory

The pwd for processes running in each context.

Mapping

Host: N/A
App: N/A
Cluster: {root}/analysis_results/{analysis_job_id}/{first_two_uuid}/{uuid}/
- example: /work/a2o/analysis_results/1/ab/abcdef01-23456789a-bcde-f1234567890a/
- Default working directory for PBS jobs is $PBS_JOBDIR e.g. /home/ecosounds
- We assume our job file is in the output directory and that $PBS_O_WORK_DIR aligns with output directory
- Need to cd $PBS_O_WORK_DIR as soon as the job starts
Container: /data/output
- The container will likely run some other executable, and that should be done with pwd set to /data/output

The job file

The templated shell script.

The script should be hidden so it is not returned via API directory listings of results.

There's also no need to give the script an extension.

Mapping

Host: {root}/analysis_results/{analysis_job_id}/{first_two_uuid}/{uuid}/.{job_name}
- example: /work/a2o/analysis_results/1/ab/abcdef01-23456789a-bcde-f1234567890a/.7890
App: {root}/analysis_results/{analysis_job_id}/{first_two_uuid}/{uuid}/
- example: /data/analysis_results/1/ab/abcdef01-23456789a-bcde-f1234567890a/.7890
Cluster: {root}/analysis_results/{analysis_job_id}/{first_two_uuid}/{uuid}/
- example: /work/a2o/analysis_results/1/ab/abcdef01-23456789a-bcde-f1234567890a/.7890
Container: N/A
- there's no need for the container to see the job script that is running it

The job logs

PBS automatically outputs logs for a run job.

We'll merge stdout and stderr into one file (with the -j oe PBS option).

Mapping

Host: {root}/analysis_results/{analysis_job_id}/{first_two_uuid}/{uuid}/.{job_name}.o.{job_id}
- example: /work/a2o/analysis_results/1/ab/abcdef01-23456789a-bcde-f1234567890a/.7890.o.o3329590
App: {root}/analysis_results/{analysis_job_id}/{first_two_uuid}/{uuid}/.{job_name}.o.{job_id}
- example: /data/analysis_results/1/ab/abcdef01-23456789a-bcde-f1234567890a/.7890.o.o3329590
Cluster: {root}/analysis_results/{analysis_job_id}/{first_two_uuid}/{uuid}/.{job_name}.o.{job_id}
- example: /work/a2o/analysis_results/1/ab/abcdef01-23456789a-bcde-f1234567890a/.7890.o.o3329590
- Note though: our job will never see this file, it is written by PBS after job completion
Container: N/A
- the container will never see the job log.

Analysis concepts - QutEcoacoustics/baw-server GitHub Wiki

The source path

Mapping

The output directory

Mapping

The config path

Mapping

The tmp directory

Mapping

The working directory

Mapping

The job file

Mapping

The job logs

Mapping

The `source` path

The `output` directory

The `config` path

The `tmp` directory