Class_EngineDVFileJson - GateNLP/gateplugin-LearningFramework GitHub Wiki
Class EngineDVFileJson
This page describes the inner workings of class EngineDVFileJson
which implements the engine for using external algorithms with dense vectors represented in JSON format (and handled out-of-memory).
Plan for how to change this later
- decide what the best way is to invoke the scripts on both Linux and Windows (and Mac) and implement
- May need to be intelligent about finding python
But for now, we simply call the train.sh and apply.sh scripts which in turn set up everything to run the actual wrapper's python script using the proper python command.
Other TODOs
- make sure the wrapper config (wrappername.yaml) in the datadir is properly used, including for finding the proper python and for running the shell script using the proper shell
- make sure the wrapper info (wrapperInfo.yaml) in the wrapper directory is properly used
Protocol of use
Currently (as of 2018-04-16), the invocation protocol for engines is a bit complex. The required protocol depends on the situation the engine gets used in (training versus application).
When training:
- The engine class gets selected in the PR based on the trainingAlgorithm runtime PR
Engine.createEngine(trainingAlgorithm, algorithmParameters, featureInfo, TargetType, dataDirectory)
is called- this executes the non-static
initializeAlgorithm(algorithm,parms)
method (overriden but empty for EngineDVFileJson) - then runs method
initWhenCreating(directory, algorithm, parms, featureInfo, targetType)
: for EngineDVFileJson, this essentially creates the instance of the appropriate corpus representation and sets the mode to "adding". - creates and initializes the Info instance
- returns the Engine instance
- this executes the non-static
- document processing uses the corpus representation retrieved from the engine to add new instances
- After all documents have been processed, the engine's info gets updated
- Then
engine.trainModel(dataDir, instanceAnnotationType, algoParms)
gets called:- turns off adding for the corpus representation
- updates the info
- copies the whole wrapper software unless already there (based on
WRAPPER_NAME
) - creates the command to invoke the training script, also using the settings in the config file
WRAPPER_NAME.yaml
which is treated as a key/value map - See section "Script Invocation" below
- this optionally uses settings
shellcmd
andshellparms
for running the shell script - TODO: this should also allow to configure the python path and python location
- before running the command, sets environment variable
WRAPPER_HOME
which is a subdirectory of the data directory. - runs the command
- updates the info and saves it
- saves the featureInfo (NOTE: this is currently done again later in the saveEngine method)
- Finally
engine.saveEngine(dataDir)
gets called (from base class Engine) which:- saves the feature info using
featureInfo.save(dir)
- invokes the engine-specific
saveModel(dir)
class, in this case, this does nothing since the model gets saved by the scripts we call - invokes the engine-specific
saveCorpusRepresentation(dir)
class, which in this case does nothing, since the corpus representation is already out-of-memory and stored to a file
- saves the feature info using
When applying a model:
- call
engine.loadEngine(datadir, parms)
-- this static method in turn runs:- load the Info
- load the FeatureInfo
- create a new instance of the Engine class (which is stored in the Info)
- Set the info in the new instance
- call the engine's
initWhenLoading(dir, parms)
method. This is NOT overridden by the EngineDVFileJson class and calls:- the engine-specific
loadModel(dir, parms)
class which is overriden:- runs
loadAndSetCorpusRepresentation(dir)
(NOTE: this is duplicate but non-harming duplicate see below) - if not already there, copies the wrapper software
- builds the command for running the application script (similar to train script, just different name)
- See section "Script Invocation" below
- starts the script to communicate with
- runs
- runs the engining specific
loadAndSetCorpusRepresentation(dir)
method - creates the algorithm instance
- calls the engine's
initializeAlgorithm(algorithm, parms)
method -- overriden but does nothing
- the engine-specific
- processes all the documents, calling
engine.applyModel(...)
. This is overriden to, for each instance in the document:- convert the annotation to json
- send the json to the process
- get back the json from the process
- convert what we get back to model application instances and collect them
- return all the model application instances
Script Invocation
Both the training and application scripts get invoked in the following way:
- depending on OS, the train.sh/train.cmd or apply.sh/appply.cmd script gets run
- if there is a wrapper config file, that defines "shellcmd" and optionally "shellparms", this program and options get used to invoke the script
- the environment variable
WRAPPER_HOME
is set to the wrapper-specific directory - the environment variable
GATE_LF_DATA_DIR
is set to the data directory that contains the data and model (this is the parent ofWRAPPER_HOME
) - the environment variable
PYTHON_BIN
is set to whatever is defined in the wrapper config file asPYTHON_BIN
or to some default determined by the engine (currently, just "python" on Linux andC:\Users\johann\Miniconda3\python.exe
on Windows.
Training
For training, the command we currently build consists of the following components:
- The script name within the main wrapper directory (e.g. FileJsonPytorch) which is "train." plus the OS-specific extension, e.g. "train.sh"
- the absolute path of the meta file
- the absolute path of the model file (prefix)
- plus any optional algorithm parameters
The script is then supposed to do the following:
- find the proper python command
- find the location of the python file to use for training (train.py in the library root directory)
- setup environment variable PYTHONPATH to include the libraries
- invoke the training script using the python interpreter found, passing the following parameters
- the full path of the meta file
- the model file (prefix)
- plus any optional algorithm parameters
Application
For application, the apply.sh script is invoked which in turn runs the backend library's append.py program. The program reads instances from standard input and writes back the result of applying the model.
The JSON EngineDVFileJson Interchange Format is documented separately.