JobManager - GateNLP/cloud-client GitHub Wiki

The class uk.ac.gate.cloud.job.JobManager is the main entry point if you want to configure, control or monitor existing GATE Cloud annotation jobs that you have previously reserved.

Creating a client

To create a client instance you need your API key ID and password - if you do not have an API key you can generate one from your account page on the GATE Cloud website. You will need to enable the "job management" permission for the API key.

JobManager mgr = new JobManager("<key id>", "<password>");

Accessing your jobs

You can list all the annotation jobs using the listJobs method

List<JobSummary> allMyJobs = mgr.listJobs();

You can filter the list based on the job state, for example to check whether there are any jobs that are currently running. Note that the objects returned from listJobs are just a brief summary of the job's state, you must call the details() method to fetch the full details (which requires another HTTP call).

If you know the specific job you are interested in you can access it directly using getJob

Job job = mgr.getJob(15);

The `Job` class

The Job class is the primary interface to a single annotation job. Instances of this class can be fetched from the JobManager, or returned by a shop Item when you reserve a job. The class has public fields holding various details about the job, both static information such as the job's ID and name, and attributes which update as the job runs such as the amount of data processed. The Job object is a snapshot of the job's state at the point when it was retrieved, you can use the refresh() method to update the object's state from the server.

The public methods of Job break down into several categories:

defining the documents to be processed (the add*Input methods)
defining what to do with the documents once they have been annotated (the add*Output methods)
controlling the job (start(), stop(), etc.)
monitoring the job's progress (executionLog)
downloading the job's results once it is complete (results())

The documents to process

Job provides several methods to specify what documents this job should process. GATE Cloud annotation jobs can take their input from data bundles, archives (.zip, .tar, .tar.gz or .tar.bz2) which can contain ZIP or TAR archives of documents, Internet Archive ARC or WARC files (for example produced by the Heritrix web crawler), or in the case of social media data, from JSON files in the format returned by the Twitter streaming endpoints, or in the interaction format produced by DataSift. In any of these cases you can either upload local files from your own machine or you can point to files that are hosted on Amazon S3, see the DataManager page for full details of how to upload a data bundle.

The addBundleInput method configures the job to take its input from a data bundle. It takes a single parameter, the bundle identifier, and all other configuration details for the input specification are inherited from the bundle. It returns an InputDetails object representing the input you have just created, but in most cases the input will not require further configuration.

Outputs

Once you have defined what documents you want to process, you must decide what to do with the resulting annotations. There are two main options:

save the annotations to files in one of several different formats (the output files will be packaged into ZIP archives which you can download when the job is complete) or
send the documents to a Mímir server for indexing.

For files, there are several different formats available:

GATE XML - the standoff annotation format used in GATE Developer
XCES - an XML-based standoff annotation format that puts the plain text into one file and the annotations into another (expressed as character offsets into the text)
Inline XML with the annotations expressed as XML elements - note that this format cannot represent annotations that partially overlap (i.e. where neither annotation completely encloses the other)

You can also opt to save annotations in JSON in the form used by Twitter to represent "entities" in Tweets. Each document becomes a JSON object with a "text" property containing the text and an "entities" property representing the annotations grouped by type. If the original input files were Twitter JSON then this format attempts to preserve the original JSON as much as possible but with the annotations added. The JSON objects representing each processed document are concatenated together, separated by newlines, and the whole bundle is compressed using GZIP. JSON outputs do not go into the ZIP archives along with other output types.

You can add several independent output specifications saving different groups of annotations in different formats for the same document. The annotations to save are given using annotation selector expressions as detailed in the REST API documentation on cloud.gate.ac.uk.

Output persons = job.addFileOutput(OutputType.XCES, ".person.xml", ":Person");
Output locOrg = job.addFileOutput(OutputType.XCES, ".loc-org.xml", ":Location, :Organization");
Output json = job.addJSONOutput(null); // null means default selectors from the pipeline

To index documents in Mímir you must specify the URL of the index you want to push to, and the username and password if required.

Output mimir = job.addMimirOutput("http://mimir.example.com/sample-index", "manager", "p4ssw0rd");

Controlling the job

The third category of Job methods are concerned with controlling the job's execution. Once a job is fully configured it can be started using

job.start()

A running job can be aborted using job.stop(), which will attempt to stop the job gracefully, aborting any sub-tasks that are already in progress and preventing any un-started tasks from beginning.

If your account runs out of funds while a job is executing that job will be automatically suspended by the system. In-progress tasks for a suspended job will be allowed to complete but no further tasks will start - once funds are available you can tell the job to continue using job.resume(). Once a job is complete and you have downloaded its results (see below) you can reset() it to enable you to modify its settings and run it again. Finally, if you have completely finished with a job you can delete() it to free up resources (and stop the system pestering you!).

Monitoring job execution

As your job executes it generates various logging messages, which can be fetched using the executionLog method. This method can optionally take a range of timestamps to fetch messages just within a particular time window - typically if you are polling repeatedly for log messages you would provide a from timestamp of the last time you polled, so as not to receive the same messages over and over again.

Calendar timestamp = Calendar.getInstance();
while(job.state != JobState.COMPLETED) {
  Thread.sleep(15000);
  List<LogMessage> messages = job.executionLog(timestamp, null); // null "to" means everything after "from"
  timestamp = Calendar.getInstance(); // reset "from" time
  doSomethingWithLog(messages);
  job.refresh(); // to refresh job.state
}

Obtaining the results

The results of an annotation job become a data bundle, which can be downloaded using the data API. The resultBundle() method gives you a DataBundle object containing the results of the most recent execution, or null if the job has not completed.

Aside from the main result bundle, a job produces a summary report containing statistics on the data that was processed. Additional files may be added in the future. To download these supplementary files you can use the reports() method. This gives you a list of objects that each has a urlToDownload() method. This extra level of indirection is required because the download URLs are time-limited - if they were all generated up-front and it took longer than 15 minutes to download them all then the later ones would time out and fail to work - so it is important to start downloading from the urlToDownload() as soon as you have requested it.

DataBundle bundle = job.resultBundle();
// see the data API page for more details

for(Downloadable res : job.reports()) {
  URL u = res.urlToDownload();
  String filename = u.getPath();
  filename = filename.substring(filename.lastIndexOf('/'));
  FileUtils.copyURLToFile(u, new File(outputDir, filename)); // from Apache commons-io
}

Alternatively, if you have your own Amazon S3 account you can configure the job to send its output directly there rather than saving it in a data bundle from which you download it. To do this call job.outputToS3 before you start the job.