ScriptHarvestTechnology: Script Harvest Component - IKANOW/Aleph2-examples GitHub Wiki

Overview

The Script Harvester enables users to launch bash scripts from the v2 nodes.

The Script Harvester is accessible as /app/aleph2/library/script_harvester.jar

Security considerations

Currently the script harvester is not safe for non-admin users on secure clusters. It will shortly be integrated with RBAC but in the meantime access to this enrichment engine should be restricted to admin users (by restricting the read rights of the uploaded JAR).

Logging

The script harvester currently has no Aleph2 logging.

Bucket configuration

The script harvester has a simple configuration model, it allows for 3 modes of operation explained below, only 1 should be specified:

{   
   "script": string,
   "local_script_url": string,
   "resource_name": string,
   "args": { 
     "string_key_1":"string_val_1",
     "string_key_2":"string_val_2",
     //etc, etc
   },
   "required_assets": ["/path/to/asset1","/path/to/asset2"],

   "max_runtime_s": integer,
   "watchdog_enabled": boolean
}

max_runtime_s: if specified, then the script is killed after (approximately) this amount of time.
watchdog_enabled: if true (default) then every bucket.poll_frequency period then the script will be checked to see if it's still running and restarted if not.

Inline script (script):

The inline script allows you to pass the bash script with the bucket inline. It will copy the script to a local file and run it on the machine. Use \n to separate lines in your script. The v1 source editor allows you to write script in the JS window and reference it in this script variable (see example below)

Local File (local_script_url):

The local file allows you to run a script that exists on the harvester node already (this should probably be used for debugging only, as you will have to keep all the harvester nodes updated with the script file or use node affinity to only reference nodes you specifically have put the file on (see node affinity in Aleph2 documentation)). Reference the local file directly by its path on the machine e.g.: "local_script_url":"/tmp/myscript.sh" The harvester will just run the script where it is.

Uploaded File (resource_name):

Allows you to upload a script in a jar file (you can just zip up a single zip file and change the zip extension from .zip to .jar). Upload that file as a v2 bucket and you can reference in your bucket then. You will need to do 2 things to reference your uploaded file:

Add the uploaded jar as a 'library_ids_or_names' in the harvest config
Reference the filename in the "resource_name" field

See the example at the end of this document. This method ensures that your script is on all machines attempting to run it and will cause the harvester to copy the script locally to run.

Additional Args (args):

This field is optional, it allows customization between buckets that you want to use the same script on but send different input values to (e.g. you could write a script to read in a specific file type from the local machine and you want to pass an arg on where to look for the file). Each arg specified in the map will be sent to the bash script as an environmental variable exactly as they are written. They can then be accessed in a bash script by $keyname.

Default Args:

A set of default arguments are passed in as environment variables with every script. They can be accessed the same as any additional args you specify (see above) e.g. $argument_name. The default arguments are:

$A2_MODULE_PATH - path to cached module jars
$A2_LIBRARY_PATH - path to the cached library jars
$A2_CLASS_PATH - path of the library + module jars mentioned above
$A2_BUCKET_HDFS_PATH - path of the bucket in hdfs
$A2_BUCKET_PATH - subpath to the bucket
$A2_BUCKET_STR - string of the buckets json (so you can recreate the bucket)
$A2_BUCKET_SIGNATURE - The bucket's unique signature (eg used to generate Elasticsearch indexes, Kafka topics, etc)
$A2_TEST_NUM_OBJECTS - if this is a test run, contains how many test objects were requested
$A2_TEST_MAX_RUNTIME_S - if this is a test run, contains how many seconds the test should run maximally

Required Assets (required_assets)

If this bucket requires additional resources that are located on some nodes (or all) you can add paths to those resources file here in a list. The bucket will only run on nodes that can locate all these files. This is useful if your script calls other scripts.

Global configuration

In most cases it will not be necessary to apply global configuration, but the following fields are configurable per shared library:

{
   "working_dir": string, // defaults to System.getProprety("java.io.tmpdir") which is typically /tmp
}

Where:

working_dir: this location is used as the location the bash script is ran from (e.g. as if you cd'd to /tmp and ran your bash script from there)

Example Inline-script config (using v1 JS block):

"config": {"script": "$$SCRIPT_bash$$"},
"scripting": {
                "bash": {
                    "script": "touch /tmp/r_1\necho \"hi\" > /tmp/r_1",
                    "separator_regex": "//ALEPH2_MODULE-.*"
                },
                "sub_prefix": "$$SCRIPT_",
                "sub_suffix": "$$"
            },
//...rest of v2 bucket

Example Uploaded file script config:

 "harvest_configs": [{
                "config": {"resource_name": "my_script.sh"}, //step 2, point to file in jar
                "enabled": true,
                "library_ids_or_names": ["/app/aleph2/library/my_uploaded_scripts.jar"], //step 1, reference uploaded jar with script
                "name": "harvester_1"
            }],

Logging

The script harvester logs at INFO with subsystem ScriptHarvestService every time the script is started or restarted.

Currently there is no way to retrieve the output of the process into the user logs, though this is on the roadmap.