Class_EngineDVFileJson - GateNLP/gateplugin-LearningFramework GitHub Wiki

Class EngineDVFileJson

This page describes the inner workings of class EngineDVFileJson which implements the engine for using external algorithms with dense vectors represented in JSON format (and handled out-of-memory).

Plan for how to change this later

  • decide what the best way is to invoke the scripts on both Linux and Windows (and Mac) and implement
  • May need to be intelligent about finding python

But for now, we simply call the train.sh and apply.sh scripts which in turn set up everything to run the actual wrapper's python script using the proper python command.

Other TODOs

  • make sure the wrapper config (wrappername.yaml) in the datadir is properly used, including for finding the proper python and for running the shell script using the proper shell
  • make sure the wrapper info (wrapperInfo.yaml) in the wrapper directory is properly used

Protocol of use

Currently (as of 2018-04-16), the invocation protocol for engines is a bit complex. The required protocol depends on the situation the engine gets used in (training versus application).

When training:

  • The engine class gets selected in the PR based on the trainingAlgorithm runtime PR
  • Engine.createEngine(trainingAlgorithm, algorithmParameters, featureInfo, TargetType, dataDirectory) is called
    • this executes the non-static initializeAlgorithm(algorithm,parms) method (overriden but empty for EngineDVFileJson)
    • then runs method initWhenCreating(directory, algorithm, parms, featureInfo, targetType): for EngineDVFileJson, this essentially creates the instance of the appropriate corpus representation and sets the mode to "adding".
    • creates and initializes the Info instance
    • returns the Engine instance
  • document processing uses the corpus representation retrieved from the engine to add new instances
  • After all documents have been processed, the engine's info gets updated
  • Then engine.trainModel(dataDir, instanceAnnotationType, algoParms) gets called:
    • turns off adding for the corpus representation
    • updates the info
    • copies the whole wrapper software unless already there (based on WRAPPER_NAME)
    • creates the command to invoke the training script, also using the settings in the config file WRAPPER_NAME.yaml which is treated as a key/value map
    • See section "Script Invocation" below
    • this optionally uses settings shellcmd and shellparms for running the shell script
    • TODO: this should also allow to configure the python path and python location
    • before running the command, sets environment variable WRAPPER_HOME which is a subdirectory of the data directory.
    • runs the command
    • updates the info and saves it
    • saves the featureInfo (NOTE: this is currently done again later in the saveEngine method)
  • Finally engine.saveEngine(dataDir) gets called (from base class Engine) which:
    • saves the feature info using featureInfo.save(dir)
    • invokes the engine-specific saveModel(dir) class, in this case, this does nothing since the model gets saved by the scripts we call
    • invokes the engine-specific saveCorpusRepresentation(dir) class, which in this case does nothing, since the corpus representation is already out-of-memory and stored to a file

When applying a model:

  • call engine.loadEngine(datadir, parms) -- this static method in turn runs:
    • load the Info
    • load the FeatureInfo
    • create a new instance of the Engine class (which is stored in the Info)
    • Set the info in the new instance
    • call the engine's initWhenLoading(dir, parms) method. This is NOT overridden by the EngineDVFileJson class and calls:
      • the engine-specific loadModel(dir, parms) class which is overriden:
        • runs loadAndSetCorpusRepresentation(dir) (NOTE: this is duplicate but non-harming duplicate see below)
        • if not already there, copies the wrapper software
        • builds the command for running the application script (similar to train script, just different name)
        • See section "Script Invocation" below
        • starts the script to communicate with
      • runs the engining specific loadAndSetCorpusRepresentation(dir) method
      • creates the algorithm instance
      • calls the engine's initializeAlgorithm(algorithm, parms) method -- overriden but does nothing
  • processes all the documents, calling engine.applyModel(...). This is overriden to, for each instance in the document:
    • convert the annotation to json
    • send the json to the process
    • get back the json from the process
    • convert what we get back to model application instances and collect them
    • return all the model application instances

Script Invocation

Both the training and application scripts get invoked in the following way:

  • depending on OS, the train.sh/train.cmd or apply.sh/appply.cmd script gets run
  • if there is a wrapper config file, that defines "shellcmd" and optionally "shellparms", this program and options get used to invoke the script
  • the environment variable WRAPPER_HOME is set to the wrapper-specific directory
  • the environment variable GATE_LF_DATA_DIR is set to the data directory that contains the data and model (this is the parent of WRAPPER_HOME)
  • the environment variable PYTHON_BIN is set to whatever is defined in the wrapper config file as PYTHON_BIN or to some default determined by the engine (currently, just "python" on Linux and C:\Users\johann\Miniconda3\python.exe on Windows.

Training

For training, the command we currently build consists of the following components:

  • The script name within the main wrapper directory (e.g. FileJsonPytorch) which is "train." plus the OS-specific extension, e.g. "train.sh"
  • the absolute path of the meta file
  • the absolute path of the model file (prefix)
  • plus any optional algorithm parameters

The script is then supposed to do the following:

  • find the proper python command
  • find the location of the python file to use for training (train.py in the library root directory)
  • setup environment variable PYTHONPATH to include the libraries
  • invoke the training script using the python interpreter found, passing the following parameters
    • the full path of the meta file
    • the model file (prefix)
    • plus any optional algorithm parameters

Application

For application, the apply.sh script is invoked which in turn runs the backend library's append.py program. The program reads instances from standard input and writes back the result of applying the model.

The JSON EngineDVFileJson Interchange Format is documented separately.