Simple Batch Example - MeirellesLab/AzureCustomTasks GitHub Wiki

Azure Batch Python Quickstart

This example is a port of the Run a Batch Job sample code, provided by Microsoft. To show how simple it is to create a new batch job using ACT.

The major difference between this and the Hello World Example is that now we are simulating a real execution scenario where the inputs and scripts are placed in a storage container.

To start, follow the preparation steps.
Then, create the storage container (mydata) that will be used to inputs/script/output.
And, upload the inputs and script files. You can't create an empty folder in a blob container, so don't forget to specify the folder when uploading each file.

Then run ACT from the root folder using the command:

# To preview the execution before sending it to the cloud
python3 src/act/azure_custom_tasks.py -j examples/simplesbatch/config.json -s

# To run the command
python3 src/act/azure_custom_tasks.py -j examples/simplebatch/config.json

Here, you can take a look at what is new in the config.json file content:

...
"tasks": {
    "addCollectionStep":10,
    "inputs": {
      "areBlobsInInputStorage":true,
      "inputFileExtension":".txt",
      "outputFileExtension":"",
      "filterOutExistingBlobInOutputStorage": false,
      "filterOutExistingTaskInCurrentJob": false,
    },
    "resources": {
      "automaticInputsUpload":true,
      "automaticScriptsUpload":true
    },
    "logs": {
      "automaticUpload":true,
      "destinationPath":"logs/simplebatch/",
      "pattern":"../std*"
    },
    "outputs": {
      "automaticUpload":false
    },
    "command":"bash -c \"./my_script.sh ",
    "commandSuffix":"\"",
    "retryCount":0,
    "retentionTimeInMinutes":1000
  }
...
"storage": {
  ...
  "input": {
      "container":"mydata",
      "path":"inputs/",
      "blobPrefix":"taskdata"
    },
...

The parameter tasks.inputs."areBlobsInInputStorage" is telling ACT to get its inputs from the configured input container, which here is mydata (storage.input."container":"mydata") inside folder inputs/ (storage.input."path":"inputs"), notice that due to the parameter storage.input."blobPrefix":"taskdata" only the files that start with the string taskdata will be selected as inputs.

The parameter tasks.inputs."inputFileExtension":".txt" defines the extension of the input files and **_tasks.inputs."outputFileExtension":"" define their expected output corresponding extension. The later, is required if you want to remove from the input list the inputs that already have their corresponding output in the storage. For this, you also need to put true in the parameter tasks.inputs."filterOutExistingBlobInOutputStorage", but in this example, it is false.

The tasks.inputs."filterOutExistingTaskInCurrentJob": false, can be used to filter from the input list the inputs that already have a task associated with it. This is useful if a specific input caused an error and you want to place it to run again. In this case, you only need to set this parameter to true, erase the failed Task, and run the command again. This avoids the need to write a new job only to re-execute failed Tasks.

The parameters tasks.resources."automaticInputsUpload":true and tasks.resources."automaticScriptsUpload":true specifies that the input and script blobs must be copied to each Task, in their respective compute node. They will be placed in the current working directory of each Task and be executed from there.

Finally, the tasks."command" and tasks."commandSuffix" specifies the command that will be sent to each Task. Which for the first input is:

bash -c "./my_script.sh 'taskdata0.txt' "

After the execution check if the outputs for each Task placed in the logs, correspond to each input file.