Prodigal Example - MeirellesLab/AzureCustomTasks GitHub Wiki

PRODIGAL - Parallelizing Gene Prediction in Large-Scale Metagenomic Data

The Prodigal Software is a widely used tool, developed to predict protein-coding genes in prokaryotic genome data. Prodigal is known to be fast and accurately handling draft genomes and metagenomes, recognizing gaps, and partial genes, and identifying translation initiation sites.

In our scenario, we had to analyze metagenomics sequencing data of more than 7000 .fasta files, with heterogeneous sizes ranging from 1MB to 30GB, totaling over 8TB of data.

The analysis of metagenomic data poses big challenges, especially when we use large datasets. This has to be carefully accounted for since it requires a massive amount of CPU and memory to be executed, plus good disk space management to avoid system crashes such as memory limit, disk full, and other major performance and resource problems.

The solution to those issues could lead to high costs and performance loss, if we use a high-performance machine for all inputs, or too complicated designs, like partitioning the input in sets by required resources, according to their size, to be executed by different pools of VMs of distinct computational capacities. This would require creating and monitoring numerous instances of batch pools and jobs to execute the same command, which means more time to code and debug each instance, challenging the development of that analysis.

So, using ACT we explored an Azure SDK method to better execute this partitioning. Using the taskSlotFormula configuration key, we calculate the required slot count for each execution task, according to its input size. In this way, we can split a pool with high-resource compute nodes into smaller computing slots defined by the parameter taskSlotsPerNode, and allocate tasks according to their required slots, which gauge the number of resources needed for each task—optimizing resource usage, guaranteeing that tasks run without excess resources, maximizing the number of tasks that can be allocated, at a given moment, increasing the execution time efficiency and, consequently, reducing the costs to run the entire job.

To do this, we have to edit the taskSlotFormula parameter, from the tasks.inputs configuration. This parameter is a vector of strings, each one representing a statement written using Python language. The statements represent each line of the code to calculate the required_slots attribute of each Task.

"requiredSlotFormula": [
  "vmMemorySize = 32000000000",
  "maxTaskSlotSize = vmMemorySize / $pool.taskSlotsPerNode",
  "calculatedSlots = int(input_size/maxTaskSlotSize) + 1",
  "requiredSlots = calculatedSlots if (input_size > maxTaskSlotSize) else 1"
]

These statements can be as simple as an assignment:

vmMemorySize = 32000000000

Can refer to another configuration attribute using the characters '$' (dollar sign) and '.' (dot) to follow the json hierarch to it, like in:

maxTaskSlotSize = vmMemorySize / $pool.taskSlotsPerNode
# The **taskSlotsPerNode** from the pool configuration determines how many slots each compute node will be split. 
# This must take into account the **vm_size** and its corresponding number of vCPU, RAM and Disk Space. 
# Here we use the VM memory size to calculate the amount of memory available for each slot.

Can use some built-in Python functions like: abs, all, any, bin, bool, chr, float, str, hash, int, len, max, min, ord, range, sum, reversed, round, sorted, filter, format. ANY OTHER CALL IS RESTRICTED AND WILL RESULT IN AN ERROR:

calculatedSlots = int(input_size/maxTaskSlotSize) + 1
# We have access to the input_name and input_size from each input.
# So, we can use these variables in the formulas to make the result specific to the input characteristics.

At the end of these formulas you should make an assignment to the variable requiredSlots, which has a default value of 1 and will be used to designate how many slots the Task will occupy.

requiredSlots = calculatedSlots if (input_size > maxTaskSlotSize) else 1
# We can use any Python built-in operator and control structures like: **+, -, /, %, &, |, ==, if, else, for, while, and so on**.
# But we cannot define class, function, or import.

This setup will make our Tasks run without using excess resources.

Try to run it yourself using the files from our repository

Remember to:

Place the config.json in your working directory
Edit the config.json to your Azure Account specs
Upload the inputs and scripts to their Storage Containers