EgineDVFileJson_InterchangeFormat - GateNLP/gateplugin-LearningFramework GitHub Wiki

EngineDVFileJson Interchange Format

This is the format of JSON data used during application. The Learning Framework sends instance data in JSON format to the backend and reads back the result of applyign the model to the data.

Format for sending instance data, LF to backend

  • For sequence tagging and classification, a single instance (NOT a batch) is sent. (the apply.py program converts the single instance into a batch of size 1)
  • The instance representation sent is what is returned by internal2json(instance,true)
  • for sequence tagging:
    • A list of lists (list of sequence elements, where each sequence element is a list of features)
  • for classification:
    • A list of features
  • only a single instance is sent in each line for processing
  • ((Not actually done at the moment: In order to indicate the end of processing a line with STOP is sent instead))

NOTE

At least for now, the convention of the LF is to send one instance per batch, but the convention of the apply function in the backend is that batches of any number of instances are accepted. The conversion from single instance to batch happens in the apply.py script.

Conversion of the instance data in the backend

In the apply.py program

  • The single instance is converted into a batch with one element (list that contains a single instance)

In the apply() method

  • Features are converted to "converted representation" (e.g. strings to vocab indices)
  • The batch is reshaped:
    • instead of a list of instances, the batch contains a list of features
    • for each feature, there is a list that contains the feature value/s for each example
    • since we only have one example, each list for a feature contains one element. That element is can be a value (classification and simple feature) or a list (sequence tagging)
  • Eventually this is converted into
    • classification: tensor of shape 1, nfeatures
    • sequence tagging: tensor of shape 1, nfeatures, seqlen

Example for sequence tagging:

  • original instance: ["Finally"],[","],["a"],["boy"],["in"],["the"],["back"],["raises"],["his"],["hand"],["."](/GateNLP/gateplugin-LearningFramework/wiki/"Finally"],[","],["a"],["boy"],["in"],["the"],["back"],["raises"],["his"],["hand"],[".")
  • converted batch of 1 instance: [[1827], [4], [7], [3241], [10], [3], [99], [1], [66], [479], [2](/GateNLP/gateplugin-LearningFramework/wiki/[1827],-[4],-[7],-[3241],-[10],-[3],-[99],-[1],-[66],-[479],-[2)]
  • reshaped batch of 1 instance: [[1827, 4, 7, 3241, 10, 3, 99, 1, 66, 479, 2](/GateNLP/gateplugin-LearningFramework/wiki/[1827,-4,-7,-3241,-10,-3,-99,-1,-66,-479,-2)]

Format for sending application results, backend to LF

The backend sends back a map with the following keys and values:

  • "status": either "ok" or something else if error
  • "error": only if error, value is the exception
  • "output": a single label (classification) or a list of labels (sequence tagging)
  • "conf": a single confidence value (classification) or a list of confidence values (sequence tagging)
  • "dist": a list of nclasses values (classification) or a list of lists with nclasses values (sequence tagging)
  • "labels": an array of the labels in the order of their indices (same order as the scores)