Report on HPC vs Local Environments for INA Kaldi Workflow - AudiovisualMetadataPlatform/amp_documentation GitHub Wiki

Report on HPC vs Local Environments for INA/Kaldi Workflow

The following is a report on the results of a test of two different computing environments for a Kaldi-based transcript workflow in AMP: a high-performance computing (HPC) environment in IU's Carbonate computing cluster, and a local environment in AMPPD. While it was (correctly) assumed that the HPC environment would provide significant performance increases over the local environment, it was not known precisely how much the performance increase would be. To determine this, the 32 content files (equaling 10 hours of content) that had been submitted to the HMGM workflow were submitted to two new workflows in Galaxy consisting of the INA speech segmenter and Kaldi running in each respective environment. Timing data for each workflow was then collected in an Excel workbook. This was then saved as a CSV and imported into Python for basic statistical analysis in Pandas. The results are reported below.

Data

The data was collected by hand from Galaxy into an Excel workbook. The data consists of start and end times for each content file in each workflow, as well as running times for INA and Kaldi for each workflow. Total running time (wall time) was additionally derived for each workflow from the start and end times. All running times are presented in seconds, except for the INA and Kaldi times for the HPC workflow, which is presented in seconds with six decimal points of precision for fractions of seconds.

[hpc_time_data.csv(/confluence-prd/download/attachments/683706352/hpc_time_data.csv?version=1&modificationDate=1615227901000&api=v2)[]]

Results

\ Elapsed Time (INA HPC) Elapsed Time (Kaldi HPC) Wall Time (HPC) Elapsed Time (INA Local) Elapsed Time (Kaldi Local) Wall Time (Local)

Count 32 32 32 32 32 32

Mean 30.511828 58.216046 2234.78125 651.78125 7196.8125 28750.75

Std. 6.271749 39.983899 850.536178 289.456636 4880.088593 47541.597546

Min 17.400371 17.532955 259 171 753 3223

25% 24.9424 24.244938 2029.25 473 3347 6156.25

50% 31.244485 43.986367 2736 605.5 6122 13356.5

75% 35.242369 80.581342 2764 855.25 9553 21612

Max 44.028269 154.489659 3353 1263 17172 177323

(All times in seconds)

The results show that, on average, there is a per-file time reduction of 79.52% for files in the HPC environment as opposed to the local environment, based on the median total (wall) times for each workflow (2,736 seconds, or 45 minutes and 36 seconds, versus 13,356.5 seconds, or 3 hours, 42 minutes, and 36.5 seconds, respectively); the median was used instead of the mean because the data for the local workflow was right-skewed due to three extreme outliers. In other words, on average, a file running through the HPC workflow can be expected to finish processing in roughly 20 percent of the time it would take for the same file to finish processing in the local workflow. This is a substantial performance increase, to say the very least. The increase is made more substantial by the fact that the HPC workflow can handle a larger number of files: while exact maximum numbers for each are unknown, all 32 files were submitted simultaneously to the HPC workflow, yet the files needed to be separated into groups of 6 for the local workflow.

For Kaldi, there is an average per-file time reduction of 99.28%, based on the median running times (~44 seconds and 6,122 seconds), and for the INA speech segmenter, there is a reduction of 94.84% (median times of 31.24 seconds and 605.5 seconds). This again shows enormous performance increases in the HPC workflow.

Attachments:

hpc_time_data.csv (text/csv)\

Document generated by Confluence on Feb 25, 2025 10:39