Analysis concepts - QutEcoacoustics/baw-server GitHub Wiki

Our analysis system is designed to use a PBS cluster to execute jobs.

The following is an overview of how path mappings work between host nodes, servers, and containers.

In this scenario there are four filesystems we have to contend with:

  1. Host: The host for the baw-server
    • we assume the filesystems of the web servers and the workers is functionally equivalent
  2. App: The container for the baw-server
  3. Cluster: The PBS file system
    • We assume the filesystems on the head node and the worker nodes in the cluster are functionally equivalent
  4. Analysis: The file system of whatever analysis container is run (if used)

How these various paths and working directories map between these filesystems is the subject of this article.

We'll work with some sample data:

  • Instance: ecosounds
  • API URL: api.ecosounds.org
  • Analysis Job Id: 1
  • Analysis Job Item Id: 7890
  • Audio Recording ID: 1234
  • Audio Recording UUID: abcdef01-23456789a-bcde-f1234567890a
  • Audio Recording recorded at: 2022-12-15T09:30:00+10:00
  • Site: SERF NE

The source path

The source path is the path to the file to be analyzed. Even if the original audio path is available cross filesystem, we won't use it because:

  • we want nice names (not guids)
  • we want our download counter to increment
  • we don't want to expose writeable access to original audio directory
  • we're more flexible in the long run if we decouple our file systems

Mapping

  • Host: N/A
  • App: N/A
  • Cluster: $TMPDIR/source/20221215T093000+1000_SERF_NE_1234.wav
    • e.g. /data1/pbs.3329590.pbs/source/20221215T093000+1000_SERF_NE_1234.wav
    • downloaded on job start with something like curl -OJ https://api.ecosounds.org/audio_recordings/1234/original
    • deleted at end of job
  • Container: /data/source/20221215T093000+1000_SERF_NE_1234.wav via a bind mount from $TMPDIR/source/ to /data/source

The output directory

The output directory is the folder where we want analysis results to go. Unlike the source, it is mapped directly to the file system.

Mapping

  • Host: {root}/analysis_results/{analysis_job_id}/{first_two_uuid}/{uuid}/
    • example: /work/a2o/analysis_results/1/ab/abcdef01-23456789a-bcde-f1234567890a/
  • App: {root}/analysis_results/{analysis_job_id}/{first_two_uuid}/{uuid}/
    • example: /data/analysis_results/1/ab/abcdef01-23456789a-bcde-f1234567890a/
  • Cluster: {root}/analysis_results/{analysis_job_id}/{first_two_uuid}/{uuid}/
    • example: /work/a2o/analysis_results/1/ab/abcdef01-23456789a-bcde-f1234567890a/
  • Container: /data/output

The config path

We allow each analysis to be customized with a configuration file. Similar to the audio recording we download the config file when needed.

Mapping

  • Host: N/A
  • App: N/A
  • Cluster: $TMPDIR/config/{config_file}
    • example: /data1/pbs.3329590.pbs/config/Towsey.Acoustic.yml
    • downloaded on job start with something like curl https://api.ecosounds.org/analysis_jobs/1 | jq ... | Towsey.Acoustic.yml
    • deleted at end of job
  • Container: /data/config/Towsey.Acoustic.yml via a bind mount from $TMPDIR/config/ to /data/config

The tmp directory

A dedicated space to write temporary files. Deleted after the job is run.

Mapping

  • Host: N/A
  • App: N/A
  • Cluster: $TMPDIR/tmp
    • example: /data1/pbs.3329590.pbs/tmp
  • Container: /data/tmp via a bind mount from $TMPDIR/tmp to /data/tmp

The working directory

The pwd for processes running in each context.

Mapping

  • Host: N/A
  • App: N/A
  • Cluster: {root}/analysis_results/{analysis_job_id}/{first_two_uuid}/{uuid}/
    • example: /work/a2o/analysis_results/1/ab/abcdef01-23456789a-bcde-f1234567890a/
    • Default working directory for PBS jobs is $PBS_JOBDIR e.g. /home/ecosounds
    • We assume our job file is in the output directory and that $PBS_O_WORK_DIR aligns with output directory
    • Need to cd $PBS_O_WORK_DIR as soon as the job starts
  • Container: /data/output
    • The container will likely run some other executable, and that should be done with pwd set to /data/output

The job file

The templated shell script.

The script should be hidden so it is not returned via API directory listings of results.

There's also no need to give the script an extension.

Mapping

  • Host: {root}/analysis_results/{analysis_job_id}/{first_two_uuid}/{uuid}/.{job_name}
    • example: /work/a2o/analysis_results/1/ab/abcdef01-23456789a-bcde-f1234567890a/.7890
  • App: {root}/analysis_results/{analysis_job_id}/{first_two_uuid}/{uuid}/
    • example: /data/analysis_results/1/ab/abcdef01-23456789a-bcde-f1234567890a/.7890
  • Cluster: {root}/analysis_results/{analysis_job_id}/{first_two_uuid}/{uuid}/
    • example: /work/a2o/analysis_results/1/ab/abcdef01-23456789a-bcde-f1234567890a/.7890
  • Container: N/A
    • there's no need for the container to see the job script that is running it

The job logs

PBS automatically outputs logs for a run job.

We'll merge stdout and stderr into one file (with the -j oe PBS option).

Mapping

  • Host: {root}/analysis_results/{analysis_job_id}/{first_two_uuid}/{uuid}/.{job_name}.o.{job_id}
    • example: /work/a2o/analysis_results/1/ab/abcdef01-23456789a-bcde-f1234567890a/.7890.o.o3329590
  • App: {root}/analysis_results/{analysis_job_id}/{first_two_uuid}/{uuid}/.{job_name}.o.{job_id}
    • example: /data/analysis_results/1/ab/abcdef01-23456789a-bcde-f1234567890a/.7890.o.o3329590
  • Cluster: {root}/analysis_results/{analysis_job_id}/{first_two_uuid}/{uuid}/.{job_name}.o.{job_id}
    • example: /work/a2o/analysis_results/1/ab/abcdef01-23456789a-bcde-f1234567890a/.7890.o.o3329590
    • Note though: our job will never see this file, it is written by PBS after job completion
  • Container: N/A
    • the container will never see the job log.