DataManager - GateNLP/cloud-client GitHub Wiki

The class uk.ac.gate.cloud.data.DataManager is the main entry point if you want to upload or download data stored on the GATE Cloud platform in persistent "data bundles".

Creating a client

To create a client instance you need your API key ID and password - if you do not have an API key you can generate one from your account page on the GATE Cloud website. You will need to enable the "read data bundles" and/or "write data bundles" permissions for the API key - "read" permits listBundles, getBundle and the download of files from a bundle, "write" permits the creation and upload of new bundles as well as operations like "rename" that modify existing bundles.

DataManager mgr = new DataManager("<key id>", "<password>");

Accessing your data bundles

You can list all the data bundles owned by your user account using the listBundles method

List<DataBundleSummary> allMyBundles = mgr.listBundles();

Note that the objects returned from listBundles are just a summary of the bundle's essential properties, you must call the details() method to fetch the full details (which requires another HTTP call).

If you know which specific bundle you are interested in you can access it directly by ID or by URL (if you have a detail URL returned by another API) using getBundle

DataBundle bundle = mgr.getBundle(27);

Creating new data bundles

The DataManager provides methods to create new data bundles, either by uploading local files to a storage location managed by GATE Cloud or by reference to existing objects stored in your own bucket on Amazon S3. Since the primary use of user-created data bundles in the GATE Cloud platform is to act as input to annotation jobs, each bundle maintains metadata describing the type of data it contains and the settings that should be applied to jobs that read from the bundle. In particular, all files within the same bundle must be of the same kind and share the same settings.

A bundle may contain any of the types of file that are valid as input for an annotation job, namely

  • archive files in the ZIP or (optionally compressed) TAR formats
  • web archive files in either the Heritrix ARC format or the standard WARC format used by various web crawlers
  • social media streams, in the JSON-based formats produced by Twitter APIs or by the DataSift platform.

ARC and WARC bundles are created using the mgr.createARCBundle* methods, and other bundles (ZIP, TAR or JSON) are created using the mgr.createArchiveBundle* methods. In both cases there are two variants available to create a bundle by uploading local files or by pointing to existing files on S3. See the JavaDoc documentation for full details.

In the case of uploads you can either pass one or more files direct to the create*FromUploads method, which will upload these files and return a fully-configured bundle, or you can create the bundle empty and then add files to it one by one using the addFile methods of the returned DataBundle, then close() the bundle yourself.

The DataBundle class

The DataBundle class is the primary interface to a single data bundle. Instances of this class can be fetched from the DataManager or returned from other APIs (e.g. the resultBundle() method of a Job). The class has public fields holding the details about the bundle, and is a snapshot of the bundle's state at the point when it was retrieved.

If the data bundle permits its contents to be directly downloaded, then you can do this by using the files field. This is a list of objects that each has a urlToDownload() method. This extra level of indirection is required because the download URLs are time-limited - if they were all generated up-front and it took longer than 15 minutes to download them all then the later ones would time out and fail to work - so it is important to start downloading from the urlToDownload() as soon as you have requested it.

Finally the DataBundle class provides methods to rename an existing bundle, and to delete the bundle once it is no longer required. Note that uploaded data bundles incur monthly storage charges, so it is important to delete them when you no longer require their data.

⚠️ **GitHub.com Fallback** ⚠️