DevGuide Cells - VizierDB/vizier-scala GitHub Wiki
Note: This page describes the v2.0 API
Overview
A cell command defines logic for one type of 'module' in a Vizier workflow. This includes everything from language-specific cells, to data visualization cells, and file loading cells. Every single cell has a corresponding command. See the info.vizierdb.commands package for examples.
Notation
Cells are grouped into a three-tiered hierarchy:
- category: A collection of packages that are grouped together for display purposes
- package: A collection of related commands.
- command: A specific operation in a cell.
A specific command is identified by the 2-part identifier: package.command
Installing Commands
Commands are managed by info.vizierdb.commands.Commands. Since Vizier does not yet support plugin modules, all commands are hard-coded into this package. To make the command visible in Vizier's new cell interface, add your command object to one of the existing calls to register in Commands, or add a new call to add a new package. The format is:
"uniqueCommandId" -> info.vizierdb.commands.command.object.classpath.here
Implementing Commands
A command is defined by an object that mixes in the info.vizierdb.commands.Command trait. This trait requires implementations of the following methods:
name: The user-facing name of the command.parameters: A sequence of info.vizierdb.commands.Parameters that describes the arguments to the command. The parameter list is used to generate a default interface for the cell in the UI, and to serialize/deserialize the actual arguments.format: Generate a short string representation of the command that will be displayed in the notebook.title: Generate a shorter string representation of the command that will be displayed in the table of contents.process: This is the 'main' method. This method is invoked when the cell is run.predictProvenance: Currently unused. Return None.
Parameters
See info.vizierdb.commands.Parameter for specifics. Common parameter options include:
id: A unique identifier for the parameter. This is how the parameter is retrieved. It is customary to define IDs as constants in the Command objects.name: A human-readable name for the parameter. This will be displayed to the user in the default-generated user interface.required(defaulttrue): True if the user should be required to enter a value for the parameter (note: This option is ignore if a default value is given).hidden(defaultfalse): True if the parameter should be stored in the background. This is primarily useful for caching state across invocations of a command.default: Specifies a default value for the parameter (note: Not available on all parameter types)
Special parameter types include:
- BooleanParameter: Simple yes/no. Implemented by default as a checkbox.
- DecimalParameter: Any floating-point number. Implemented by default as a numeric Input box.
- IntParameter: Any integer. Implemented by default as a numeric Input box.
- StringParameter: Any (short) string (e.g., the name of an output dataset). Implemented by default as an Input box.
- DataTypeParameter: A data type. Implemented by default as a drop-down menu with standard options (e.g., String, Integer, etc...).
- CodeParameter: Source code. The
languageoption must be one ofpython,scala,sql, ormarkdown. Implemented by default as a CodeMirror. - EnumerableParameter: A list of possible options. Each option may be provided with human readable text and a string 'value' for the backend. Implemented by default as a drop down menu.
- DatasetParameter: A dataset. Implemented by default as a drop-down menu with a list of all Datasets available at that point in the notebook.
- ColIdParameter: A column in a dataset (note: Requires that there be an accompanying DatasetParameter). Implemented by default as a drop-down menu with a list of columns in the most recently specified dataset.
- ArtifactParameter: Any artifact. Implemented by default as a drop-down menu with a list of all Artifacts available at that point in the notebook. (The
artifactTypeparameter may be used to specify a class of artifacts to be selected) - FileParameter: A user-uploaded file. Implemented by default as a file drop area.
- ListParameter: A table of parameter values. Each
Parameterprovided in thecomponentsoption forms one column of the table. Users may define as many rows as desired. Implemented by default as a 2-D grid with a [+] and [-] annotations to allow insertion and deletion of rows. - RecordParameter: Like
ListParameterbut limited to a single row. Useful for grouping related parameters together. Implemented by default as an option group. - EnvironmentParameter: An execution environment.
languagemust be one ofpython. Implemented by default as a drop-down menu. - RowIdParameter: Deprecated.
- CachedStateParameter: A workaround allowing cells to preserve "cached" state in between executions. Generally, you should not use this.
Arguments
format, title and process receive an info.vizierdb.commands.Arguments object that provides access to the values of the parameters specified by parameters. Retrieve the parameter value based on the [Parameter] as follows:
- BooleanParameter:
args.get[Boolean](parameterId)(returns true/false according to the parameter) - DecimalParameter:
args.get[Double](parameterId)(returns the double value of the parameter) - IntParameter:
args.get[Int](parameterId)(returns the integer value of the parameter) - StringParameter:
args.get[String](parameterId)(returns the string value of the parameter) - DataTypeParameter:
args.get[DataType](parameterId)(returns an Apache Spark DataType) - CodeParameter:
args.get[String](parameterId)(returns the code as a String -- note, this may be quite large) - EnumerableParameter:
args.get[String](parameterId)(returns thevaluefield of the selected parameter value) - DatasetParameter:
args.get[String](parameterId)(returns the name of the dataset artifact; use ExecutionContext to get the artifact) - ColIdParameter:
args.get[Int](parameterId)(returns the integer index of the column) - ArtifactParameter:
args.get[String](parameterId)(returns the name of the artifact; use ExecutionContext to get the artifact) - FileParameter:
args.get[FileArgument](parameterId)(returns a FileArgument to the uploaded/provided file) - ListParameter:
args.getList(parameterId)(returns a Sequence of Arguments, one per record) - RecordParameter:
args.getRecord(parameterId)(returns an Arguments)
An Opt version of the above methods (e.g., args.getOpt) exists to return an Option if the argument is not provided (e.g., if required = false for the corresponding Parameter).
Artifacts
Artifacts are used to pass state between cells. Anything that needs to go from one cell to the next must be wrapped in an Artifact.
An artifact is, in general, defined by two items:
(id, projectId): A unique artifact identifier, coupled with the id of the project that created it.t: An Artifact Type, see below.data: An opaque byte array.
An artifact is immutable. Once the artifact is created, it can not be modified or destroyed (although see below for discussion). Artifact versions are identified by the identifier of the project that created them and a globally unique artifact ID. Additionally, the execution context (see below) maintains a list of mappings from friendly 'names' to artifact identifiers.
Vizier assigns artifacts types to streamline interoperability between languages and to make it possible to display artifacts inline. Supported types include:
DATASET: An Apache Spark dataframe.- The
dataparameter must be a json-encoded Dataset object. See ExecutionContext below for helper functions to create these.
- The
FUNCTION: A snippet of code defining a function.- The
mimeTypeparameter defines the type of function, and must be one ofapplication/python
- The
BLOB: An opaque blob of medium-sized data.- The
mimeTypeparameter may be used to distinguish between different datatypes, and may be anything.
- The
FILE: A file stored in the filesystem (preferred for large data).- The
mimeTypeparameter is used to store the type of the file.
- The
PARAMETER: A small, configurable value (currently used to pass Strings, Integers, etc... between python cells).- The
dataparameter must be a json-encoded ParameterArtifact
- The
VEGALITE: A Vega Lite chart.- The
dataparameter must be a json object conforming to the Vega-Lite spec. e.g., see vizier-vega.
- The
The Artifact class defines several useful helper methods (see the ScalaDoc for the full description):
artifact.file: A java [File] object holding the path to the file for this artifact. This method is usually only helpful forFILE-typed artifacts. However, any artifact may be defined with an associated file if on-disk storage is required.artifact.parameter(PARAMETERonly): The ParameterArtifact value of the artifact.artifact.data: The raw data bytes of the artifact (Note: if the artifact stores data in a file, you must usefileinstead).artifact.string: The raw data of the artifact as a string (Shortcut fornew String(artifact.data)).artifact.json: The json value of the artifact (Shortcut forJson.parse(artifact.string)). If thedatafield is empty, an empty object will be returned.artifact.dataframe(DATASETonly): Obtain the Spark dataframe for the artifact. Note: You must have an active database connection to call this method (see DevGuide-Gotchas).artifact.datasetSchema(DATASETonly): Obtain the schema of the specified dataset (as a sequence of Spark StructFields)artifact.datasetPropertyOpt(name)(DATASETonly): Obtain a specified dataset property or None if the property is not set.artifact.datasetProperty(name)(DATASETonly): Obtain a specified dataset property.artifact.updateDatasetProperty(name, value)(DATASETonly): Update a dataset property.artifact.filePropertyOpt(name)(FILEonly): Obtain a specified file property or None if the property is not set.artifact.fileProperty(name)(FILEonly): Obtain a specified file property.artifact.fileDatasetProperty(name, value)(FILEonly): Update a file property.
DATASET and FILE artifacts may be associated with properties, allowing assorted metadata to be associated with the dataset or file. Although these property sets are mutable, they are intended as a way to enact lazy computations: Expensive computations over the files that are delayed until they are actually needed. For example, a common use is to store profiler metadata like the number of rows in a dataset. In short, property fields should be treated as being append-only.
ExecutionContext
process also receives an info.vizierdb.commands.ExecutionContext that describes the notebook state at the point of the cell. The context can be used to retrieve artifacts, create artifacts, or output messages.
With respect to artifacts, an ExecutionContext stores a mapping from user-friendly names to specific artifact versions (as noted above, an artifact version is an immutable object identified by a project and artifact id pair). When a cell is run, the execution context it receives is the accumulation of all artifacts created by preceding cells. To emphasize this point: unlike Jupyter, the state a cell sees is based on the order in which cells appear in the notebook and not the order in which cells are executed.
In addition to artifacts, an ExecutionContext may also be used to send messages to the user. These are displayed below the cell in the notebook, but are not visible to any subsequent cells.
For full documentation, see the ScalaDoc for the class.
Reading Artifacts
context.artifact(name): Obtain the Artifact with the specifiedname.context.dataframe(name): Obtain a Spark DataFrame for the artifact with the specifiedname; Triggers an error if this artifact does not exist, or is not a DATASET (equivalent to callingcontext.artifact(name).dataframe, but also creates a database session).context.parameter[T](name): Obtain the value of a parameter artifact assuming that the parameter has a type that decodes toT. Throws an error ifnamedoes not exist, is not a parameter artifact, or decodes to a type other thanT.context.file(name){ source => ... }: Read the contents of the FILE or BLOB artifact with the specifiedname. The provided block takes a scala Source object. The method returns the value returned by the provided block. (e.g.,context.file("foo"){ source => Json.parse(source.readlines.mkString) }would return the json contents of the file)
Messaging
context.message(message): Display the provided message formatted in a fixed-width font.context.error(message): Display the provided message and flag the cell execution as having triggered an error (the cell will be highlighted, and subsequent cells that depend on this one will not be executed).context.displayHTML(html[, javascript[, javascriptDependencies]][, cssDependencies]): Display the providedhtml, rendered as HTML. See below for a discussion of the remaining parameters.context.vega(chart, identifier): Output a Vega chart with the specifiedidentifier(TODO: this parameter should be callednamefor consistency) as both a message and an artifact. The optionalwithMessageorwithArtifactparameters can be set to false to hide the chart from either.context.vegalite(chart, identifier): (Deprecated) Output a VegaLite chart with the specifiedidentifier(TODO: this parameter should be callednamefor consistency) as both a message and an artifact. The optionalwithMessageorwithArtifactparameters can be set to false to hide the chart from either.context.displayDataset(name): Display the dataset with the provided artifactname.
Writing Artifacts
context.output(name, t, data): Allocate and output a new generic artifact of the specified type. This method is only encouraged for BLOB artifacts; Use one of the helper methods below if one exists.context.setParameter(name, value, dataType): Output a new PARAMETER artifact with the specifiednameandvalue;valueis not type-checked, but must be of a type that Spark will accept fordataType.context.outputDataset(name, constructor): Output a new DATASET artifact with the specifiedname.constructormust be a subclass of DataFrameConstructor; see the info.vizierdb.spark package for existing instances like the SQL ViewConstructor.context.outputFile(name, mimeType) { stream => ... }: Output a new FILE artifact with the specifiednameandmimeType. The provided block should write the file's contents to the provided java OutputStream.context.outputFilePlaceholder(name, mimeType): Output a new FILE artifact with the specifiednameandmimeType. This method allocates an artifact placeholder, but does not actually create a file. The caller is responsible for creating the file by writing to the path identified by the artifact'sfilemethod. (outputFileis preferred, as it automatically closes the file)context.outputDatasetWithFile(name, gen): LikeoutputDatset, butgenis a function that takes an artifact and returns a DataFrameConstructor. This can be helpful when the dataframe needs to read from a file, since you get a chicken and egg problem where the DataFrameConstructor needs to know the ID of the artifact that stores it.context.createPipeline(input, [output])(stage, [stage, [...]]): Output a new DATASET artifact from the result of applying a Spark Pipeline to an input dataset. The pipeline will be trained on the specifiedinputdataset, and theoutputdataset will be defined by applying the pipeline to theinputdataset. Ifoutputis omitted, theinputdataset will be replaced.context.outputDataframe(name, dataframe): Output a new DATASET artifact consisting of the contents of the specified dataframe. Note that any provenance for the output data will be lost, sooutputDatasetis generally preferred.context.delete(name): Delete the specified artifact from the context. Note that the artifact will not actually be deleted. Rather this command makes it so that later cells will be unable to access the artifact.
HTML Output
The context.displayHTML method allows cells to display messages with arbitrary formatting and interactivity. In addition to the html itself, the cell may provide javascript. The provided javascript will run after the html is mounted in the DOM.
When it is necessary to refer to DOM nodes in the javascript, do not use hard-coded node ids. If the same workflow module appears twice in a notebook, the same node id will appear twice. Even if your javascript replaces the node id, there may be race conditions when the notebook is re-opened. Instead assign nodes an id based on the return value of context.executionIdentifier. DOM node ids prefixed with this value are guaranteed to be reserved for use by the context's module.
The javascriptDependencies and cssDependencies parameters allow mutable dependencies to be dynamically loaded --- for example, javascript and css tied to a specific version of Bokeh. These should be provided as references to a CDN or similar (if Vizier has the reference locally, it will be used). Dependencies will only be loaded once; the first time they appear in a notebook. The provided javascript will not be executed until any javascript dependencies have loaded.