DevGuide Cells - VizierDB/vizier-scala GitHub Wiki

Note: This page describes the v2.0 API

Overview

A cell command defines logic for one type of 'module' in a Vizier workflow. This includes everything from language-specific cells, to data visualization cells, and file loading cells. Every single cell has a corresponding command. See the info.vizierdb.commands package for examples.

Notation

Cells are grouped into a three-tiered hierarchy:

category: A collection of packages that are grouped together for display purposes
package: A collection of related commands.
command: A specific operation in a cell.

A specific command is identified by the 2-part identifier: package.command

Installing Commands

Commands are managed by info.vizierdb.commands.Commands. Since Vizier does not yet support plugin modules, all commands are hard-coded into this package. To make the command visible in Vizier's new cell interface, add your command object to one of the existing calls to register in Commands, or add a new call to add a new package. The format is:

"uniqueCommandId" -> info.vizierdb.commands.command.object.classpath.here

Implementing Commands

A command is defined by an object that mixes in the info.vizierdb.commands.Command trait. This trait requires implementations of the following methods:

name: The user-facing name of the command.
parameters: A sequence of info.vizierdb.commands.Parameters that describes the arguments to the command. The parameter list is used to generate a default interface for the cell in the UI, and to serialize/deserialize the actual arguments.
format: Generate a short string representation of the command that will be displayed in the notebook.
title: Generate a shorter string representation of the command that will be displayed in the table of contents.
process: This is the 'main' method. This method is invoked when the cell is run.
predictProvenance: Currently unused. Return None.

Parameters

See info.vizierdb.commands.Parameter for specifics. Common parameter options include:

id: A unique identifier for the parameter. This is how the parameter is retrieved. It is customary to define IDs as constants in the Command objects.
name: A human-readable name for the parameter. This will be displayed to the user in the default-generated user interface.
required (default true): True if the user should be required to enter a value for the parameter (note: This option is ignore if a default value is given).
hidden (default false): True if the parameter should be stored in the background. This is primarily useful for caching state across invocations of a command.
default: Specifies a default value for the parameter (note: Not available on all parameter types)

Special parameter types include:

BooleanParameter: Simple yes/no. Implemented by default as a checkbox.
DecimalParameter: Any floating-point number. Implemented by default as a numeric Input box.
IntParameter: Any integer. Implemented by default as a numeric Input box.
StringParameter: Any (short) string (e.g., the name of an output dataset). Implemented by default as an Input box.
DataTypeParameter: A data type. Implemented by default as a drop-down menu with standard options (e.g., String, Integer, etc...).
CodeParameter: Source code. The language option must be one of python, scala, sql, or markdown. Implemented by default as a CodeMirror.
EnumerableParameter: A list of possible options. Each option may be provided with human readable text and a string 'value' for the backend. Implemented by default as a drop down menu.
DatasetParameter: A dataset. Implemented by default as a drop-down menu with a list of all Datasets available at that point in the notebook.
ColIdParameter: A column in a dataset (note: Requires that there be an accompanying DatasetParameter). Implemented by default as a drop-down menu with a list of columns in the most recently specified dataset.
ArtifactParameter: Any artifact. Implemented by default as a drop-down menu with a list of all Artifacts available at that point in the notebook. (The artifactType parameter may be used to specify a class of artifacts to be selected)
FileParameter: A user-uploaded file. Implemented by default as a file drop area.
ListParameter: A table of parameter values. Each Parameter provided in the components option forms one column of the table. Users may define as many rows as desired. Implemented by default as a 2-D grid with a [+] and [-] annotations to allow insertion and deletion of rows.
RecordParameter: Like ListParameter but limited to a single row. Useful for grouping related parameters together. Implemented by default as an option group.
EnvironmentParameter: An execution environment. language must be one of python. Implemented by default as a drop-down menu.
RowIdParameter: Deprecated.
CachedStateParameter: A workaround allowing cells to preserve "cached" state in between executions. Generally, you should not use this.

Arguments

format, title and process receive an info.vizierdb.commands.Arguments object that provides access to the values of the parameters specified by parameters. Retrieve the parameter value based on the [Parameter] as follows:

BooleanParameter: args.get[Boolean](parameterId) (returns true/false according to the parameter)
DecimalParameter: args.get[Double](parameterId) (returns the double value of the parameter)
IntParameter: args.get[Int](parameterId) (returns the integer value of the parameter)
StringParameter: args.get[String](parameterId) (returns the string value of the parameter)
DataTypeParameter: args.get[DataType](parameterId) (returns an Apache Spark DataType)
CodeParameter: args.get[String](parameterId) (returns the code as a String -- note, this may be quite large)
EnumerableParameter: args.get[String](parameterId) (returns the value field of the selected parameter value)
DatasetParameter: args.get[String](parameterId) (returns the name of the dataset artifact; use ExecutionContext to get the artifact)
ColIdParameter: args.get[Int](parameterId) (returns the integer index of the column)
ArtifactParameter: args.get[String](parameterId) (returns the name of the artifact; use ExecutionContext to get the artifact)
FileParameter: args.get[FileArgument](parameterId) (returns a FileArgument to the uploaded/provided file)
ListParameter: args.getList(parameterId) (returns a Sequence of Arguments, one per record)
RecordParameter: args.getRecord(parameterId) (returns an Arguments)

An Opt version of the above methods (e.g., args.getOpt) exists to return an Option if the argument is not provided (e.g., if required = false for the corresponding Parameter).

Artifacts

Artifacts are used to pass state between cells. Anything that needs to go from one cell to the next must be wrapped in an Artifact.

An artifact is, in general, defined by two items:

(id, projectId): A unique artifact identifier, coupled with the id of the project that created it.
t: An Artifact Type, see below.
data: An opaque byte array.

An artifact is immutable. Once the artifact is created, it can not be modified or destroyed (although see below for discussion). Artifact versions are identified by the identifier of the project that created them and a globally unique artifact ID. Additionally, the execution context (see below) maintains a list of mappings from friendly 'names' to artifact identifiers.

Vizier assigns artifacts types to streamline interoperability between languages and to make it possible to display artifacts inline. Supported types include:

DATASET: An Apache Spark dataframe.
- The data parameter must be a json-encoded Dataset object. See ExecutionContext below for helper functions to create these.
FUNCTION: A snippet of code defining a function.
- The mimeType parameter defines the type of function, and must be one of application/python
BLOB: An opaque blob of medium-sized data.
- The mimeType parameter may be used to distinguish between different datatypes, and may be anything.
FILE: A file stored in the filesystem (preferred for large data).
- The mimeType parameter is used to store the type of the file.
PARAMETER: A small, configurable value (currently used to pass Strings, Integers, etc... between python cells).
- The data parameter must be a json-encoded ParameterArtifact
VEGALITE: A Vega Lite chart.
- The data parameter must be a json object conforming to the Vega-Lite spec. e.g., see vizier-vega.

The Artifact class defines several useful helper methods (see the ScalaDoc for the full description):

artifact.file: A java [File] object holding the path to the file for this artifact. This method is usually only helpful for FILE-typed artifacts. However, any artifact may be defined with an associated file if on-disk storage is required.
artifact.parameter (PARAMETER only): The ParameterArtifact value of the artifact.
artifact.data: The raw data bytes of the artifact (Note: if the artifact stores data in a file, you must use file instead).
artifact.string: The raw data of the artifact as a string (Shortcut for new String(artifact.data)).
artifact.json: The json value of the artifact (Shortcut for Json.parse(artifact.string)). If the data field is empty, an empty object will be returned.
artifact.dataframe (DATASET only): Obtain the Spark dataframe for the artifact. Note: You must have an active database connection to call this method (see DevGuide-Gotchas).
artifact.datasetSchema (DATASET only): Obtain the schema of the specified dataset (as a sequence of Spark StructFields)
artifact.datasetPropertyOpt(name) (DATASET only): Obtain a specified dataset property or None if the property is not set.
artifact.datasetProperty(name) (DATASET only): Obtain a specified dataset property.
artifact.updateDatasetProperty(name, value) (DATASET only): Update a dataset property.
artifact.filePropertyOpt(name) (FILE only): Obtain a specified file property or None if the property is not set.
artifact.fileProperty(name) (FILE only): Obtain a specified file property.
artifact.fileDatasetProperty(name, value) (FILE only): Update a file property.

DATASET and FILE artifacts may be associated with properties, allowing assorted metadata to be associated with the dataset or file. Although these property sets are mutable, they are intended as a way to enact lazy computations: Expensive computations over the files that are delayed until they are actually needed. For example, a common use is to store profiler metadata like the number of rows in a dataset. In short, property fields should be treated as being append-only.

ExecutionContext

process also receives an info.vizierdb.commands.ExecutionContext that describes the notebook state at the point of the cell. The context can be used to retrieve artifacts, create artifacts, or output messages.

With respect to artifacts, an ExecutionContext stores a mapping from user-friendly names to specific artifact versions (as noted above, an artifact version is an immutable object identified by a project and artifact id pair). When a cell is run, the execution context it receives is the accumulation of all artifacts created by preceding cells. To emphasize this point: unlike Jupyter, the state a cell sees is based on the order in which cells appear in the notebook and not the order in which cells are executed.

In addition to artifacts, an ExecutionContext may also be used to send messages to the user. These are displayed below the cell in the notebook, but are not visible to any subsequent cells.

For full documentation, see the ScalaDoc for the class.

Reading Artifacts

context.artifact(name): Obtain the Artifact with the specified name.
context.dataframe(name): Obtain a Spark DataFrame for the artifact with the specified name; Triggers an error if this artifact does not exist, or is not a DATASET (equivalent to calling context.artifact(name).dataframe, but also creates a database session).
context.parameter[T](name): Obtain the value of a parameter artifact assuming that the parameter has a type that decodes to T. Throws an error if name does not exist, is not a parameter artifact, or decodes to a type other than T.
context.file(name){ source => ... }: Read the contents of the FILE or BLOB artifact with the specified name. The provided block takes a scala Source object. The method returns the value returned by the provided block. (e.g., context.file("foo"){ source => Json.parse(source.readlines.mkString) } would return the json contents of the file)

Messaging

context.message(message): Display the provided message formatted in a fixed-width font.
context.error(message): Display the provided message and flag the cell execution as having triggered an error (the cell will be highlighted, and subsequent cells that depend on this one will not be executed).
context.displayHTML(html[, javascript[, javascriptDependencies]][, cssDependencies]): Display the provided html, rendered as HTML. See below for a discussion of the remaining parameters.
context.vega(chart, identifier): Output a Vega chart with the specified identifier (TODO: this parameter should be called name for consistency) as both a message and an artifact. The optional withMessage or withArtifact parameters can be set to false to hide the chart from either.
context.vegalite(chart, identifier): (Deprecated) Output a VegaLite chart with the specified identifier (TODO: this parameter should be called name for consistency) as both a message and an artifact. The optional withMessage or withArtifact parameters can be set to false to hide the chart from either.
context.displayDataset(name): Display the dataset with the provided artifact name.

Writing Artifacts

context.output(name, t, data): Allocate and output a new generic artifact of the specified type. This method is only encouraged for BLOB artifacts; Use one of the helper methods below if one exists.
context.setParameter(name, value, dataType): Output a new PARAMETER artifact with the specified name and value; value is not type-checked, but must be of a type that Spark will accept for dataType.
context.outputDataset(name, constructor): Output a new DATASET artifact with the specified name. constructor must be a subclass of DataFrameConstructor; see the info.vizierdb.spark package for existing instances like the SQL ViewConstructor.
context.outputFile(name, mimeType) { stream => ... }: Output a new FILE artifact with the specified name and mimeType. The provided block should write the file's contents to the provided java OutputStream.
context.outputFilePlaceholder(name, mimeType): Output a new FILE artifact with the specified name and mimeType. This method allocates an artifact placeholder, but does not actually create a file. The caller is responsible for creating the file by writing to the path identified by the artifact's file method. (outputFile is preferred, as it automatically closes the file)
context.outputDatasetWithFile(name, gen): Like outputDatset, but gen is a function that takes an artifact and returns a DataFrameConstructor. This can be helpful when the dataframe needs to read from a file, since you get a chicken and egg problem where the DataFrameConstructor needs to know the ID of the artifact that stores it.
context.createPipeline(input, [output])(stage, [stage, [...]]): Output a new DATASET artifact from the result of applying a Spark Pipeline to an input dataset. The pipeline will be trained on the specified input dataset, and the output dataset will be defined by applying the pipeline to the input dataset. If output is omitted, the input dataset will be replaced.
context.outputDataframe(name, dataframe): Output a new DATASET artifact consisting of the contents of the specified dataframe. Note that any provenance for the output data will be lost, so outputDataset is generally preferred.
context.delete(name): Delete the specified artifact from the context. Note that the artifact will not actually be deleted. Rather this command makes it so that later cells will be unable to access the artifact.

HTML Output

The context.displayHTML method allows cells to display messages with arbitrary formatting and interactivity. In addition to the html itself, the cell may provide javascript. The provided javascript will run after the html is mounted in the DOM.

When it is necessary to refer to DOM nodes in the javascript, do not use hard-coded node ids. If the same workflow module appears twice in a notebook, the same node id will appear twice. Even if your javascript replaces the node id, there may be race conditions when the notebook is re-opened. Instead assign nodes an id based on the return value of context.executionIdentifier. DOM node ids prefixed with this value are guaranteed to be reserved for use by the context's module.

The javascriptDependencies and cssDependencies parameters allow mutable dependencies to be dynamically loaded --- for example, javascript and css tied to a specific version of Bokeh. These should be provided as references to a CDN or similar (if Vizier has the reference locally, it will be used). Dependencies will only be loaded once; the first time they appear in a notebook. The provided javascript will not be executed until any javascript dependencies have loaded.