HPC Whisper Experiment notes (2024)

2024

Automation

Automating HPC hard. The Big Red 200 admins have done ton of things to make automation difficult. But I'm clever, so here we are.

To automate slurm job submission we have to pretend we're just someone logging into a login node and running some commands. There's a script on BR200 called hpc_service that takes some args:

submit - submit a new slurm job
check <id> - check the status of a slurm job
list - list the outstanding slurm jobs
cancel <id> - cancel a slurm job

While not all of the functionality is fully implemented, the most important one (submit) is. This single script is the entire interface for HPC stuff.

The client side of things

WIth something sitting on the server side of things, whisper jobs are submitted via hpc_whisper_client which has these args:

usage: hpc_whisper_client.py [-h] [--debug] [--engine {whisper,faster_whisper}] [--model {tiny,base,small,medium,large}] [--device {cpu,cuda}] [--vad] [--language LANGUAGE]
                             [--hpcuser HPCUSER] [--hpchost HPCHOST] [--hpcscript HPCSCRIPT] [--scpuser SCPUSER] [--scphost SCPHOST]
                             infile [infile ...] outdir

positional arguments:
  infile                Input files
  outdir                Output directory

optional arguments:
  -h, --help            show this help message and exit
  --debug               Turn on debugging
  --engine {whisper,faster_whisper}
                        Use whisper or faster_whisper
  --model {tiny,base,small,medium,large}
                        Whisper model
  --device {cpu,cuda}   Computation device
  --vad                 Use VAD with faster_whisper
  --language LANGUAGE   Language
  --hpcuser HPCUSER     User on HPC
  --hpchost HPCHOST     HPC Host
  --hpcscript HPCSCRIPT
  --scpuser SCPUSER     SCP User
  --scphost SCPHOST     SCP File Host

When the user runs the script

~/iu_hpc_processing/hpc_whisper_client.py --model=small --engine=whisper  --device=cpu --scpuser=app_amphpc ../source2/gettysburg.wav /tmp/output

A JSON blob will be constructed based on the parameters and a connection to BR200 is made to run the hpc_service submit script.

The blob contains all of the information needed for the code on BR200 to decide how to split up the requested files into the appropriate size batches. This blob is sent to the hpc_service script via STDIN. The JSON for the above command would look like:

{
    "function": "whisper",
    "params": {
        "engine": "whisper",
        "model": "small",
        "language": "en",
        "device": "cpu",
        "vad": false
    },
    "tasklist": [
        {"infile": "/srv/storage/mdpi_research/test/source2/gettysburg.wav", "outfile": "/tmp/output/gettysburg.wav.whisper.json"}
    ],
    "probes": {
        "/srv/storage/mdpi_research/test/source2/gettysburg.wav": { ... ffprobe data .. }
    },
    "email": "[email protected]",
    "scphost": "esquilax.dlib.indiana.edu",
    "scpuser": "app_amphpc"
}

On a successful submit, the hpc_service script will return a JSON blob of slurm job ids for the tasks created.

Job submission on the server

On the server end of things the hpc_service submit script starts and reads the JSON. It uses the parameters and the ffprobe data to split the files into batches based on their duration and groups batches into slurm jobs (if a host can support more than one concurrent batch). So if the parameters allow 6 hours of content to be processed per batch and each host can support 3 concurrent batches, for 100 hours of content 17 batches are created and then grouped into jobs of 3 batches and 6 slurm jobs are created.

Each of the slurm jobs are independent and will get processed in a FIFO queue order, as soon as the necessary resources are available.

In the above example, the content is only a few minutes long so a job with a 9-hour time limit is submitted with a driver script like this as a single batch in a single slurm job:

#!/bin/bash
#SBATCH -J job-1706281910.538736-whisper-0-64
#SBATCH -A r00652
#SBATCH -o /N/scratch/bdwheele/mdpi_batches/job-1706281910.538736-whisper-0-64/stdout.txt
#SBATCH -e /N/scratch/bdwheele/mdpi_batches/job-1706281910.538736-whisper-0-64/stderr.txt
#SBATCH -t 540
#SBATCH --mail-type=ALL
#SBATCH [email protected]
#SBATCH --mincpus=64
#SBATCH --mem 64G
module load apptainer
module load ffmpeg
module load python
cd /N/scratch/bdwheele/mdpi_batches/job-1706281910.538736-whisper-0-64
time apptainer run --nv /N/home/u015/bdwheele/BigRed200/iu_hpc_processing/hpc_python.sif /N/home/u015/bdwheele/BigRed200/iu_hpc_processing/hpc_whisper_server.py <<EOF
{
    "scphost": "esquilax.dlib.indiana.edu",
    "scpuser": "app_amphpc",
    "params": {
        "engine": "whisper",
        "model": "small",
        "language": "en",
        "device": "cpu",
        "vad": false
    },
    "batches": [
        [
            {
                "infile": "/srv/storage/mdpi_research/test/source2/gettysburg.wav",
                "outfile": "/tmp/output/gettysburg.wav.whisper.json",
                "duration": 183.04
            }
        ]

    ]
}
EOF

echo $? >> returncode.txt

Actually running whisper on HPC

At some point the submitted slurm job will run the script and whisper actually start. When hpc_whisper_server starts it will read STDIN to get the JSON job description. At that point it will set up a thread for each batch and each thread will:

lookup the SSH keypair on BR200 for the scpuser specified and connect to the scphost as sftp.
load the model data into the device
for each item in the batch:
- use sftp to transfer the infile to BR200
- convert the file to 16-bit mono wave
- Run the transcription engine on the audio file
- write the transcription data to a temporary file
- transfer the transcription file to outfile via sftp

In addition to the standard JSON, job-based metadata is inserted for processing statistics:

    "_job": {
        "runtime": 68.45356750488281,
        "media_duration": 183.04,
        "job_name": "job-1706279180.3036363-whisper-0-8",
        "job_id": "2411877",
        "params": {
            "engine": "whisper",
            "model": "small",
            "language": "en",
            "device": "cpu",
            "vad": false
        },
        "infile": "/srv/storage/mdpi_research/test/whisper-small/../source2/gettysburg.wav",
        "outfile": "/srv/storage/mdpi_research/test/whisper-small/gettysburg.wav.whisper.json",
        "scp_callback": "[email protected]"
    }

At this time the slurm job directories are not cleaned up.

Transcription Throughput vs Speed

With BR200 we are competing for resources -- the cluster is used by researchers of all sorts. BR200 has two partitions: a general partition without GPU resources (640 machines) and a gpu partition with GPU resources (64 machines).

With the high demand of GPU resources and relative scarcity of them, the queuing time of GPU jobs can be 15 hours or more. So even though the GPU processing is fast, the amount of time you have to wait negates the speed advantage of the GPU.

It's still an outstanding question, but there is a reasonable likelihood that using the CPU nodes for transcription will be able to process more files over a given period of time because the wait time is often lower.

Related to this is determining how to split up a todo list of files. When submitting a job to slurm you have to supply a time limit for the job -- and jobs with higher limits are often treated with a lower priority. Additionally, if the processing has not completed by the time limit, the job is killed in mid-processing. So determining the optimal balance where the maximal files are supplied to each batch (to keep the resource as long as possible) with the number of different jobs that are created (because more shorter jobs may get scheduled faster but fewer longer jobs have potentially less queue time)

This is an outstanding issue to determine the optimal balance.

With the speed difference between GPU and CPU processing times, when trying to use CPU nodes to reduce latency it becomes import to reduce the amount of processing time that the CPU-based transcription uses.

Both Throughput and Speed are measurements of how much content can be transcribed per unit of time, they differ in how they measure. Speed is how fast a single audio file can be transcribed where throughput measures the time required for one or more audio files to be processed in aggregate.

So as an example, assume we're going to transcribe 5 60-minute audio files. For GPU-based transcription, let's say you can do an hour of content in 10 minutes and for CPU-based it takes 70 minutes per content hour. Those are the speeds.

If the jobs are executed immediately and sequentially, the GPU throughput is 10 minutes * 5 hours of content = 50 minutes, and CPU throughput is 70 minutes * 5 hours = 350 minutes. The GPU in this instance has substantially higher throughput. This is a worst case for immediate execution.

If the jobs are executed immediately and run concurrently, the GPU throughput is 10 minutes and CPU is 70 minutes. This is best case for immediate execution.

Bub jobs don't always run immediately. We're waiting for resources on the cluster and (usually) there are more people wanting for the GPU nodes than the CPU nodes. Let's say the average wait time for a GPU job is 2 hours and a CPU job is 15 minutes.

For sequential GPU, that's (10 minutes of processing - 2 hours of queue) * 5 hours of content = 10 hours, 50 minutes. For sequential CPU that works out to (70 minutes of processing - 15 minutes of queue) * 5 hours of content = 7 hours, 5 minutes. So even though the CPU has a lower speed, the throughput is higher.

The throughput becomes substantially harder to estimate when the job wait time varies (some jobs started when the cluster was idle and others had to wait because it was busy) or only some subsets of the queued jobs can run.

Whisper vs Faster Whisper

Whisper uses TensorFlow to do the transcription inference. TensorFlow is highly optimized to use GPUs for processing and works less well on a CPU platform.

The Faster Whisper project uses a different inference library (CTranslate2) where the whisper models have been translated from TensorFlow. The Faster Whisper models tend to be smaller and more CPU-friendly (roughly 4x speed improvement) without losing accuracy.

One feature that Faster Whisper has that may be handy is VAD -- it can remove non-speech segments when doing processing to reduce the overall

Accuracy Results

The test file is a reading of the Gettysburg Address from NPR, with an introduction. It is 3:03 of clean speech.

Both whisper and faster_whisper were used with the small, medium, and large models. CPU and GPU processing were both used. For faster whisper, VAD was used with the CPU.

CPU vs GPU

When comparing CPU/GPU using the same engine and model, all of the results had a Word Information Preserved of 100% -- the results were identical.

There's a definite difference between processing times (times are in content seconds per clock second, so 1 = takes as long as the content, 2 is 2x the content length, 0.5 is 1/2 the content length, etc):

Engine GPU Small Medium Large Faster No 7.435 3.127 1.376 Faster Yes 41.776 34.885 25.288 Whisper No 2.674 1.075 0.584 Whisper Yes 11.995 9.750 7.667

GPU is always faster than CPU and Faster Whisper is always faster than Whisper, often 3x-4x.

For the whisper CPU results, 8 CPU threads were used as that seems to be the maximum number that can be used without issues. Faster whisper uses 16 CPU threads which is the sweet spot for it. From a scheduling perspective, it looks like one could run more concurrent batches using whisper (since it uses fewer CPUs per batch-thread), but alas, there's an issue where running more than one whisper thread will cause the process to slow down dramatically (or hang). Faster whisper doesn't have this issue and multiple batch threads can run concurrently (I've tested up to 4 concurrent batches) so the throughput of faster whisper - CPU is substantially higher than whisper - CPU.

Models vs Ground Truth

Compared to a ground truth, here are how the models fared, with "word error rate" and "word information preserved" stats. Parenthetical stats are using compare with --nocase and --nopunc

| Engine | Model | WER | WIP | Notes |

| Faster | Large | 10% (6.28%) | 84.69% | ::::::::: co | | | | | (92.00%) | ntent-wrapper | | | | | | Majority of | | | | | | differences | | | | | | were | | | | | | punctuation | | | | | | (or numeric | | | | | | vs text | | | | | | number), but | | | | | | a couple of | | | | | | hallucination | | | | | | s/mishearings | | | | | | | | | | | | :::: | | | | | | ::: | | | | | | | | | | | | | | | | | | BA | | | | | | SE: nation so | | | | | | conceived a | | | | | | nd so dedic | | | | | | ated can lo | | | | | | ng endure. W | | | | | | e are met on | | | | | | CO | | | | | | MP: nation so | | | | | | conceited, c | | | | | | ould be a | | | | | | great ci | | | | | | vil war. W | | | | | | e are met on | | | | | | ED | | | | | | IT: | | | | | | SSSSSSSSSS S | | | | | | SSSS SS SSSSS | | | | | | SSSS SSSSS SS | | | | | | SSS SSSSSSS | | | | | | | | | | | | | | | | | | ::: | | | | | | :::: | | | | | | | | | | | | :::: | | | | | | ::: | | | | | | | | | | | | | | | | | | BASE: r | | | | | | ather, to be | | | | | | dedicated her | | | | | | e to the ** | | | | | | * | | | | | | * *** | | | | | | COMP: r | | | | | | ather, to be | | | | | | dedicated her | | | | | | e to the fact | | | | | | that we are | | | | | | here, to the | | | | | | EDIT: | | | | | | | | | | | | | | | | | | IIII | | | | | | IIII II III | | | | | | IIIII II III | | | | | | | | | | | | ::: | | | | | | :::: | | | | | | | | | | | | :::: | | | | | | ::: | | | | | | | | | | | | | | | | | | BASE: today | | | | | | . Waterston a | | | | | | lso is * ** | | | | | | * *** | | | | | | * ******* | | | | | | * ******** a | | | | | | COMP: today | | | | | | . Waterston a | | | | | | lso is a lead | | | | | | ing actor in | | | | | | the Gettysbur | | | | | | g Address, a | | | | | | EDIT: | | | | | | | | | | | | I IIII | | | | | | III IIIII II | | | | | | III IIIIIIIII | | | | | | I IIIIIIII | | | | | | | | | | | | ::: | | | | | | :::: | | | | | | | | | | | | \ | | | | | | ::::::::: |

| Faster | Medium | 0% (0%) | 100% (100%) | No errors | | | | | | compared to | | | | | | my | | | | | | transcription | | | | | | ... which | | | | | | was based on | | | | | | this | | | | | | transcription | | | | | | ![slightly | | | | | | smiling | | | | | | f | | | | | | ace](plugins/ | | | | | | servlet/twitt | | | | | | erEmojiRedire | | | | | | ctor "slightl | | | | | | y smiling fac | | | | | | e") |

| Faster | Small | 3.95% (1.31%) | 93.01% | ::::::: co | | | | | (98.17%) | ntent-wrapper | | | | | | Punctuation, | | | | | | mostly. Plus | | | | | | adding the | | | | | | word Abraham | | | | | | where it | | | | | | wasn't there | | | | | | | | | | | | :::: | | | | | | ::: | | | | | | | | | | | | | | | | | | B | | | | | | ASE: for Unio | | | | | | n war d | | | | | | ead. Though b | | | | | | rief, the now | | | | | | famou | | | | | | s Gettysburg | | | | | | C | | | | | | OMP: for Unio | | | | | | n war-dead. * | | | | | | Though b | | | | | | rief, the now | | | | | | -famous * | | | | | | * Gettysburg | | | | | | EDIT: | | | | | | SSSSSSSSS | | | | | | DDDDD | | | | | | SS | | | | | | SSSSSSSS DDDD | | | | | | DD | | | | | | | | | | | | ::: | | | | | | :::: | | | | | | | | | | | | :::: | | | | | | ::: | | | | | | | | | | | | | | | | | | BASE: portray | | | | | | ed President | | | | | | ******* Linco | | | | | | ln on stage a | | | | | | nd screen. Fo | | | | | | ur score and | | | | | | COMP: portray | | | | | | ed President | | | | | | Abraham Linco | | | | | | ln on stage a | | | | | | nd screen. Fo | | | | | | ur score and | | | | | | EDIT: | | | | | | | | | | | | IIIIIII | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ::: | | | | | | :::: | | | | | | | | | | | | If the nopunc | | | | | | option | | | | | | converted | | | | | | '-' to a | | | | | | space rather | | | | | | than just | | | | | | removing it, | | | | | | the only | | | | | | difference | | | | | | between this | | | | | | and the | | | | | | transcript is | | | | | | the addition | | | | | | of | | | | | | 'Abraham'. | | | | | | ::::::: |

| Whisper | Large | 4.74% (1.57%) | 92.06% | ::::::: co | | | | | (98.19%) | ntent-wrapper | | | | | | Mostly | | | | | | punctuation | | | | | | and number vs | | | | | | text. One | | | | | | hallucination | | | | | | | | | | | | :::: | | | | | | ::: | | | | | | | | | | | | | | | | | | BASE: cannot | | | | | | dedicate, we | | | | | | cannot conse | | | | | | crate, we can | | | | | | not * | | | | | | | | | | | | COMP: cannot | | | | | | dedicate. We | | | | | | cannot dedic | | | | | | ate. We can | | | | | | not consecrat | | | | | | e. We cannot | | | | | | EDIT: | | | | | | SSSSSSSSS S | | | | | | S SSSS | | | | | | SSSSSSS SS | | | | | | IIIIIIII | | | | | | III II IIIIII | | | | | | | | | | | | ::: | | | | | | :::: | | | | | | | | | | | | With | | | | | | --nopunc, | | | | | | the only | | | | | | non-numeric | | | | | | difference is | | | | | | this one | | | | | | h | | | | | | allucination: | | | | | | | | | | | | :::: | | | | | | ::: | | | | | | | | | | | | | | | | | | BAS | | | | | | E: this but i | | | | | | n a larger se | | | | | | nse | | | | | | ** we | | | | | | cannot dedica | | | | | | te we cannot | | | | | | COM | | | | | | P: this but i | | | | | | n a larger se | | | | | | nse we cannot | | | | | | dedicate we | | | | | | cannot dedica | | | | | | te we cannot | | | | | | EDI | | | | | | T: | | | | | | | | | | | | II IIIIII | | | | | | IIIIIIII | | | | | | | | | | | | | | | | | | | | | | | | ::: | | | | | | :::: | | | | | | | | | | | | \ | | | | | | ::::::: |

| Whisper | Medium | 5.79% (2.88%) | 90.25% | ::::: co | | | | | (95.59%) | ntent-wrapper | | | | | | Mostly | | | | | | punctuation | | | | | | and number vs | | | | | | text. One | | | | | | big | | | | | | hallucination | | | | | | | | | | | | :::: | | | | | | ::: | | | | | | | | | | | | | | | | | | BASE: the pe | | | | | | ople for the | | | | | | people shal | | | | | | l not perish | | | | | | from the eart | | | | | | h. Actor Sam | | | | | | COMP: the pe | | | | | | ople, for the | | | | | | people, shal | | | | | | l not perish | | | | | | from the eart | | | | | | h. The New | | | | | | EDIT: SS | | | | | | SSSSS | | | | | | SSSSSSS | | | | | | | | | | | | | | | | | | SSSSS SSS | | | | | | | | | | | | BASE: Waterst | | | | | | on reading th | | | | | | e Gettysburg | | | | | | Address deliv | | | | | | ered by Presi | | | | | | dent Abraham | | | | | | COMP: York | | | | | | ***** | | | | | | * ********** | | | | | | Address deliv | | | | | | ered by Presi | | | | | | dent Abraham | | | | | | EDIT | | | | | | : SSSSSSSSS D | | | | | | DDDDDD DDD DD | | | | | | DDDDDDDD | | | | | | | | | | | | ::: | | | | | | :::: | | | | | | | | | | | | With | | | | | | --nopunc, | | | | | | the only | | | | | | non-numeric | | | | | | difference is | | | | | | the one | | | | | | above. | | | | | | ::::: |

| Whisper | Small | 16.58% | 79.94% | ::::::: co | | | | (13.35%) | (86.14%) | ntent-wrapper | | | | | | Punctuation, | | | | | | numbers. One | | | | | | hallucination | | | | | | and then | | | | | | totally | | | | | | chopped off | | | | | | the end... | | | | | | | | | | | | :::: | | | | | | ::: | | | | | | | | | | | | | | | | | | B | | | | | | ASE: portraye | | | | | | d President L | | | | | | incoln on sta | | | | | | ge and screen | | | | | | . * Fo | | | | | | ur score and | | | | | | C | | | | | | OMP: portraye | | | | | | d President L | | | | | | incoln on sta | | | | | | ge and screen | | | | | | . In 1914, fo | | | | | | ur score and | | | | | | E | | | | | | DIT: | | | | | | | | | | | | | | | | | | | | | | | | II IIIII SS | | | | | | SS | | | | | | | | | | | | ::: | | | | | | :::: | | | | | | | | | | | | :::: | | | | | | ::: | | | | | | | | | | | | | | | | | | BASE: for | | | | | | the people s | | | | | | hall not peri | | | | | | sh from the e | | | | | | arth. Actor S | | | | | | am Waterston | | | | | | COMP: for | | | | | | the people, s | | | | | | hall not peri | | | | | | sh from the e | | | | | | arth. *** * | | | | | | * | | | | | | EDIT: | | | | | | SSSSSSS | | | | | | | | | | | | | | | | | | DDDDD D | | | | | | DD DDDDDDDDD | | | | | | | | | | | | BA | | | | | | SE: reading t | | | | | | he Gettysburg | | | | | | Address deli | | | | | | vered by Pres | | | | | | ident Abraham | | | | | | Lincoln 140 | | | | | | CO | | | | | | MP: * * | | | | | | **** | | | | | | * | | | | | | * ** | | | | | | * * | | | | | | * * | | | | | | ED | | | | | | IT: DDDDDDD D | | | | | | DD DDDDDDDDDD | | | | | | DDDDDDD DDDD | | | | | | DDDDD DD DDDD | | | | | | DDDDD DDDDDDD | | | | | | DDDDDDD DDD | | | | | | | | | | | | BASE: yea | | | | | | rs ago today. | | | | | | Waterston al | | | | | | so is a leadi | | | | | | ng member of | | | | | | the advisory | | | | | | COMP: * | | | | | | * | | | | | | *** | | | | | | * * | | | | | | ** | | | | | | * | | | | | | EDIT: DDD | | | | | | DD DDD DDDDDD | | | | | | DDDDDDDDD DD | | | | | | DD DD D DDDDD | | | | | | DD DDDDDD DD | | | | | | DDD DDDDDDDD | | | | | | | | | | | | BAS | | | | | | E: committee | | | | | | of the Abraha | | | | | | m Lincoln Bic | | | | | | entennial Com | | | | | | mission, whic | | | | | | h is charged | | | | | | COM | | | | | | P: * | | | | | | * **** | | | | | | * ***** * | | | | | | ******* * | | | | | | **** | | | | | | * * | | | | | | EDI | | | | | | T: DDDDDDDDD | | | | | | DD DDD DDDDDD | | | | | | D DDDDDDD DDD | | | | | | DDDDDDDDD DDD | | | | | | DDDDDDDD DDDD | | | | | | D DD DDDDDDD | | | | | | | | | | | | BASE: wi | | | | | | th planning t | | | | | | he celebratio | | | | | | n of Lincoln' | | | | | | s 200th birth | | | | | | day in 2009. | | | | | | COMP: | | | | | | **** * | | | | | | ******** | | | | | | * ****** | | | | | | * *** * | | | | | | * *** | | | | | | EDIT: DD | | | | | | DD DDDDDDDD D | | | | | | DD DDDDDDDDDD | | | | | | D DD DDDDDDDD | | | | | | D DDDDD DDDDD | | | | | | DDD DD DDDDD | | | | | | | | | | | | ::: | | | | | | :::: | | | | | | | | | | | | \ | | | | | | ::::::: |

| Faster/VAD | Large | 10.79% | 84.92% | ::::::::: co | | | | | | ntent-wrapper | | | | | | Some | | | | | | ha | | | | | | llucinations, | | | | | | dropped a | | | | | | whole | | | | | | sentence.. | | | | | | | | | | | | :::: | | | | | | ::: | | | | | | | | | | | | | | | | | | BASE: nation | | | | | | so conceived | | | | | | and so **** | | | | | | dedicate | | | | | | d can long en | | | | | | dure. We are | | | | | | COMP: nation | | | | | | so conceived | | | | | | and so dedica | | | | | | ted, is | | | | | | a good na | | | | | | tion. We are | | | | | | EDIT: | | | | | | | | | | | | IIIIII | | | | | | IIII SSSSSSSS | | | | | | S SSS SSSS SS | | | | | | SSSSS | | | | | | | | | | | | ::: | | | | | | :::: | | | | | | | | | | | | :::: | | | | | | ::: | | | | | | | | | | | | | | | | | | BA | | | | | | SE: rather, t | | | | | | o be dedicate | | | | | | d here to the | | | | | | unfinished w | | | | | | ork which the | | | | | | y who fought | | | | | | CO | | | | | | MP: rather, t | | | | | | o be dedicate | | | | | | d here to the | | | | | | unfinished w | | | | | | ork which the | | | | | | y who fought | | | | | | ED | | | | | | IT: | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | B | | | | | | ASE: * ** * | | | | | | * ***** | | | | | | * here | | | | | | have thus far | | | | | | so nobly adv | | | | | | anced. It is | | | | | | C | | | | | | OMP: for us h | | | | | | ave dedicated | | | | | | to us. here | | | | | | have thus far | | | | | | so nobly adv | | | | | | anced. It is | | | | | | E | | | | | | DIT: III II I | | | | | | III IIIIIIIII | | | | | | II III | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ::: | | | | | | :::: | | | | | | | | | | | | :::: | | | | | | ::: | | | | | | | | | | | | | | | | | | BA | | | | | | SE: Actor Sam | | | | | | Waterston re | | | | | | ading the Get | | | | | | tysburg Addre | | | | | | ss delivered | | | | | | by President | | | | | | CO | | | | | | MP: * * | | | | | | ***** | | | | | | * * * | | | | | | *** * | | | | | | *** | | | | | | * | | | | | | ED | | | | | | IT: DDDDD DDD | | | | | | DDDDDDDDD DD | | | | | | DDDDD DDD DDD | | | | | | DDDDDDD DDDDD | | | | | | DD DDDDDDDDD | | | | | | DD DDDDDDDDD | | | | | | | | | | | | BAS | | | | | | E: Abraham Li | | | | | | ncoln 140 yea | | | | | | rs ago today. | | | | | | Waterston al | | | | | | so is a leadi | | | | | | ng member of | | | | | | COM | | | | | | P: * | | | | | | *** * * | | | | | | * **** | | | | | | Waterston al | | | | | | so is a leadi | | | | | | ng member of | | | | | | ED | | | | | | IT: DDDDDDD D | | | | | | DDDDDD DDD DD | | | | | | DDD DDD DDDDD | | | | | | D | | | | | | | | | | | | | | | | | | | | | | | | ::: | | | | | | :::: | | | | | | | | | | | | \ | | | | | | ::::::::: |

| Faster/VAD | Medium | 15.53% | 81.92% | ::::::: co | | | | | | ntent-wrapper | | | | | | Punctuation, | | | | | | dropped | | | | | | words. | | | | | | Truncated | | | | | | transcript | | | | | | | | | | | | :::: | | | | | | ::: | | | | | | | | | | | | | | | | | | BA | | | | | | SE: proper th | | | | | | at we should | | | | | | do this. But | | | | | | in a larger s | | | | | | ense, we cann | | | | | | ot dedicate, | | | | | | CO | | | | | | MP: proper th | | | | | | at we | | | | | | do this. But | | | | | | in a larger s | | | | | | ense, we cann | | | | | | ot dedicate, | | | | | | EDIT: | | | | | | DDDDD | | | | | | D | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ::: | | | | | | :::: | | | | | | | | | | | | :::: | | | | | | ::: | | | | | | | | | | | | | | | | | | BA | | | | | | SE: a new bir | | | | | | th of freedom | | | | | | , and that go | | | | | | vernment of t | | | | | | he people by | | | | | | the people | | | | | | CO | | | | | | MP: a new bir | | | | | | th of freedom | | | | | | , and that | | | | | | t | | | | | | he people, by | | | | | | the people, | | | | | | ED | | | | | | IT: | | | | | | | | | | | | DD | | | | | | DDDDDDDD DD | | | | | | SSSSSSS | | | | | | SSSSSSS | | | | | | | | | | | | BASE: for | | | | | | the people s | | | | | | hall not peri | | | | | | sh from the e | | | | | | arth. Actor S | | | | | | am Waterston | | | | | | COMP: for | | | | | | the people, s | | | | | | hall not peri | | | | | | sh from the e | | | | | | arth. *** * | | | | | | * | | | | | | EDIT: | | | | | | SSSSSSS | | | | | | | | | | | | | | | | | | DDDDD D | | | | | | DD DDDDDDDDD | | | | | | | | | | | | BA | | | | | | SE: reading t | | | | | | he Gettysburg | | | | | | Address deli | | | | | | vered by Pres | | | | | | ident Abraham | | | | | | Lincoln 140 | | | | | | CO | | | | | | MP: * * | | | | | | **** | | | | | | * | | | | | | * ** | | | | | | * * | | | | | | * * | | | | | | ED | | | | | | IT: DDDDDDD D | | | | | | DD DDDDDDDDDD | | | | | | DDDDDDD DDDD | | | | | | DDDDD DD DDDD | | | | | | DDDDD DDDDDDD | | | | | | DDDDDDD DDD | | | | | | | | | | | | BASE: yea | | | | | | rs ago today. | | | | | | Waterston al | | | | | | so is a leadi | | | | | | ng member of | | | | | | the advisory | | | | | | COMP: * | | | | | | * | | | | | | *** | | | | | | * * | | | | | | ** | | | | | | * | | | | | | EDIT: DDD | | | | | | DD DDD DDDDDD | | | | | | DDDDDDDDD DD | | | | | | DD DD D DDDDD | | | | | | DD DDDDDD DD | | | | | | DDD DDDDDDDD | | | | | | | | | | | | BAS | | | | | | E: committee | | | | | | of the Abraha | | | | | | m Lincoln Bic | | | | | | entennial Com | | | | | | mission, whic | | | | | | h is charged | | | | | | COM | | | | | | P: * | | | | | | * **** | | | | | | * ***** * | | | | | | ******* * | | | | | | **** | | | | | | * * | | | | | | EDI | | | | | | T: DDDDDDDDD | | | | | | DD DDD DDDDDD | | | | | | D DDDDDDD DDD | | | | | | DDDDDDDDD DDD | | | | | | DDDDDDDD DDDD | | | | | | D DD DDDDDDD | | | | | | | | | | | | BASE: wi | | | | | | th planning t | | | | | | he celebratio | | | | | | n of Lincoln' | | | | | | s 200th birth | | | | | | day in 2009. | | | | | | COMP: | | | | | | **** * | | | | | | ******** | | | | | | * ****** | | | | | | * *** * | | | | | | * *** | | | | | | EDIT: DD | | | | | | DD DDDDDDDD D | | | | | | DD DDDDDDDDDD | | | | | | D DD DDDDDDDD | | | | | | D DDDDD DDDDD | | | | | | DDD DD DDDDD | | | | | | | | | | | | ::: | | | | | | :::: | | | | | | | | | | | | \ | | | | | | ::::::: |

| Faster/VAD | Small | 5% | 90.49% | Punctuation | | | | | | only. |

It looks like Faster/VAD in non-small models it gets a little...deletey.

It seems like all of the models have issues one way or another, and those issues vary depending on the source content. Setting expectations that the transcripts are around 80-90% accurate is probably important.

Model Applicability

We may choose different models based on the type of content (commercial LPs in MDPI likely don't need super-accurate transcriptions) but for some of the stuff in the archives we might want to take special care.

Additionally, if we have a requirement for transcripts of everything in MCO, we might run a small/base model on the content to populate the transcripts quickly and assume we'll go back to more critical assets later.

Harry Potter and the Whispering Transcript

I own a copy of the Harry Potter and the Sorcerer's Stone audio book, unabridged. I took all of the the audio files and combined them into a single audio file. The total runtime is around 8:16:00.

Doing a quick search around the internet I found the text for the book (someone had it in a github repo). No idea how accurate it is, but why not. The text consists of 78,451 words.

Running whisper on unicorn with the GPU using the small model, an hour of content was processed every 9 minutes or so.

I made some changes to the whisper comparison tool that will (optionally) strip punctuation and lower case everything, to reduce the number of "pointless" differences.

Here are the overall statistics:

Base File: /srv/scratch/harry_potter_test/harry_potter_1.json
Comp File: /srv/scratch/harry_potter_test/All_Chapters.json

Stats: 
  Word Error Rate:               3.89%
  Word Information Lost:         6.64%
  Word Information Preserved:   93.36%
  Match Error Rate:              3.87%

Edit Stats:
  Hits:          75007
  Inserts:         429
  Deletes:         372
  Substitutions:  2221

That's surprisingly good. Many of the differences are Britishisms.

BASE: mrs dursley of number four privet drive were proud to say that they were 
COMP: mrs dursley of no     4    privet drive were proud to say **** they were 
EDIT:                SSSSSS SSSS                                DDDD

BASE: although he did have a very large mustache  mrs dursley was thin and 
COMP: although he did have a very large moustache mrs dursley was thin and 
EDIT:                                   SSSSSSSSS

Of course, things like dialects are a little iffy

BASE: muggles sssorry sobbed hagrid taking out a large spotted handkerchief and 
COMP: muggles sorry   subbed hagrid taken  out a large spotted handkerchief and 
EDIT:         SSSSSSS SSSSSS        SSSSSS

BASE: little harry off ter live with muggles yes yes its all very sad but get a 
COMP: little harry off to  live with muggles yes yes its all very sad but get a 
EDIT:                  SSS

Latency / Throughput Test

Using the Harry Potter audio I submitted variations of the processing to BR200: whisper/faster_whisper, cpu/cuda, small/medium/large, a total of 12 jobs. Each job is processing 8:16:00 of content, so around 100 hours content.

The BR200 cluster is configured as 640 nodes of 128 CPU threads, 256G RAM and 64 Nodes with 4 GPUs, 256G RAM, 64 CPU Threads. Slurm scheduling will split the machines based on the resource requirements -- allowing for more concurrent jobs than the number of machines would suggest.

At the time of submission, there are 2040 slurm jobs. 220 of these jobs are on the general partition (non-GPU) and 1811 of them are on the gpu partition.

On the general partition, there are 258 nodes allocated. Which is less than half of the physical node code, and it should be noted that many of physical nodes are running more than one job -- there are a bunch of nodes which have 4 jobs running on them. This means that (currently) there is a huge amount of capacity available on the general partition, and indeed, the cpu-based whispers were immediately started upon queuing.

The gpu partition, on the other hand, is vastly overallocated. There are 89 jobs with some jobs sharing the same node. That means there are 1722 jobs ahead of the whisper jobs that I submitted.

Waiting for CPU job completion and the GPU jobs to start...they've been waiting for 90 minutes so far

Document generated by Confluence on Feb 25, 2025 10:39

HPC Whisper Experiment notes 2024 - AudiovisualMetadataPlatform/amp_documentation GitHub Wiki

HPC Whisper Experiment notes (2024)

Automation

The client side of things

Job submission on the server

Actually running whisper on HPC

Transcription Throughput vs Speed

Whisper vs Faster Whisper

Accuracy Results

CPU vs GPU

Models vs Ground Truth

| Engine | Model | WER | WIP | Notes |

| Faster/VAD | Small | 5% | 90.49% | Punctuation | | | | | | only. |

Model Applicability

Harry Potter and the Whispering Transcript

Latency / Throughput Test

⚠️ GitHub.com Fallback ⚠️

HPC Whisper Experiment notes 2024 - AudiovisualMetadataPlatform/amp_documentation GitHub Wiki

HPC Whisper Experiment notes (2024)

Automation

The client side of things

Job submission on the server

Actually running whisper on HPC

Transcription Throughput vs Speed

Whisper vs Faster Whisper

Accuracy Results

CPU vs GPU

Models vs Ground Truth

| Engine | Model | WER | WIP | Notes |

| Faster/VAD | Small | 5% | 90.49% | Punctuation | | | | | | only. |

Model Applicability

Harry Potter and the Whispering Transcript

Latency / Throughput Test

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️