Analysis concepts - QutEcoacoustics/baw-server GitHub Wiki
Our analysis system is designed to use a PBS cluster to execute jobs.
The following is an overview of how path mappings work between host nodes, servers, and containers.
In this scenario there are four filesystems we have to contend with:
- Host: The host for the baw-server
- we assume the filesystems of the web servers and the workers is functionally equivalent
- App: The container for the baw-server
- Cluster: The PBS file system
- We assume the filesystems on the head node and the worker nodes in the cluster are functionally equivalent
- Analysis: The file system of whatever analysis container is run (if used)
How these various paths and working directories map between these filesystems is the subject of this article.
We'll work with some sample data:
- Instance: ecosounds
- API URL: api.ecosounds.org
- Analysis Job Id: 1
- Analysis Job Item Id: 7890
- Audio Recording ID: 1234
- Audio Recording UUID: abcdef01-23456789a-bcde-f1234567890a
- Audio Recording recorded at: 2022-12-15T09:30:00+10:00
- Site: SERF NE
source
path
The The source path is the path to the file to be analyzed. Even if the original audio path is available cross filesystem, we won't use it because:
- we want nice names (not guids)
- we want our download counter to increment
- we don't want to expose writeable access to original audio directory
- we're more flexible in the long run if we decouple our file systems
Mapping
- Host: N/A
- App: N/A
- Cluster:
$TMPDIR/source/20221215T093000+1000_SERF_NE_1234.wav
- e.g.
/data1/pbs.3329590.pbs/source/20221215T093000+1000_SERF_NE_1234.wav
- downloaded on job start with something like
curl -OJ https://api.ecosounds.org/audio_recordings/1234/original
- deleted at end of job
- e.g.
- Container:
/data/source/20221215T093000+1000_SERF_NE_1234.wav
via a bind mount from$TMPDIR/source/
to/data/source
output
directory
The The output directory is the folder where we want analysis results to go. Unlike the source, it is mapped directly to the file system.
Mapping
- Host:
{root}/analysis_results/{analysis_job_id}/{first_two_uuid}/{uuid}/
- example:
/work/a2o/analysis_results/1/ab/abcdef01-23456789a-bcde-f1234567890a/
- example:
- App:
{root}/analysis_results/{analysis_job_id}/{first_two_uuid}/{uuid}/
- example:
/data/analysis_results/1/ab/abcdef01-23456789a-bcde-f1234567890a/
- example:
- Cluster:
{root}/analysis_results/{analysis_job_id}/{first_two_uuid}/{uuid}/
- example:
/work/a2o/analysis_results/1/ab/abcdef01-23456789a-bcde-f1234567890a/
- example:
- Container:
/data/output
config
path
The We allow each analysis to be customized with a configuration file. Similar to the audio recording we download the config file when needed.
Mapping
- Host: N/A
- App: N/A
- Cluster:
$TMPDIR/config/{config_file}
- example:
/data1/pbs.3329590.pbs/config/Towsey.Acoustic.yml
- downloaded on job start with something like
curl https://api.ecosounds.org/analysis_jobs/1 | jq ... | Towsey.Acoustic.yml
- deleted at end of job
- example:
- Container:
/data/config/Towsey.Acoustic.yml
via a bind mount from$TMPDIR/config/
to/data/config
tmp
directory
The A dedicated space to write temporary files. Deleted after the job is run.
Mapping
- Host: N/A
- App: N/A
- Cluster:
$TMPDIR/tmp
- example:
/data1/pbs.3329590.pbs/tmp
- example:
- Container:
/data/tmp
via a bind mount from$TMPDIR/tmp
to/data/tmp
The working directory
The pwd
for processes running in each context.
Mapping
- Host: N/A
- App: N/A
- Cluster:
{root}/analysis_results/{analysis_job_id}/{first_two_uuid}/{uuid}/
- example:
/work/a2o/analysis_results/1/ab/abcdef01-23456789a-bcde-f1234567890a/
- Default working directory for PBS jobs is
$PBS_JOBDIR
e.g./home/ecosounds
- We assume our job file is in the output directory and that
$PBS_O_WORK_DIR
aligns with output directory - Need to
cd $PBS_O_WORK_DIR
as soon as the job starts
- example:
- Container:
/data/output
- The container will likely run some other executable, and that should be done with
pwd
set to/data/output
- The container will likely run some other executable, and that should be done with
The job file
The templated shell script.
The script should be hidden so it is not returned via API directory listings of results.
There's also no need to give the script an extension.
Mapping
- Host:
{root}/analysis_results/{analysis_job_id}/{first_two_uuid}/{uuid}/.{job_name}
- example:
/work/a2o/analysis_results/1/ab/abcdef01-23456789a-bcde-f1234567890a/.7890
- example:
- App:
{root}/analysis_results/{analysis_job_id}/{first_two_uuid}/{uuid}/
- example:
/data/analysis_results/1/ab/abcdef01-23456789a-bcde-f1234567890a/.7890
- example:
- Cluster:
{root}/analysis_results/{analysis_job_id}/{first_two_uuid}/{uuid}/
- example:
/work/a2o/analysis_results/1/ab/abcdef01-23456789a-bcde-f1234567890a/.7890
- example:
- Container: N/A
- there's no need for the container to see the job script that is running it
The job logs
PBS automatically outputs logs for a run job.
We'll merge stdout and stderr into one file (with the -j oe
PBS option).
Mapping
- Host:
{root}/analysis_results/{analysis_job_id}/{first_two_uuid}/{uuid}/.{job_name}.o.{job_id}
- example:
/work/a2o/analysis_results/1/ab/abcdef01-23456789a-bcde-f1234567890a/.7890.o.o3329590
- example:
- App:
{root}/analysis_results/{analysis_job_id}/{first_two_uuid}/{uuid}/.{job_name}.o.{job_id}
- example:
/data/analysis_results/1/ab/abcdef01-23456789a-bcde-f1234567890a/.7890.o.o3329590
- example:
- Cluster:
{root}/analysis_results/{analysis_job_id}/{first_two_uuid}/{uuid}/.{job_name}.o.{job_id}
- example:
/work/a2o/analysis_results/1/ab/abcdef01-23456789a-bcde-f1234567890a/.7890.o.o3329590
- Note though: our job will never see this file, it is written by PBS after job completion
- example:
- Container: N/A
- the container will never see the job log.