Sossity Overview

See Sossity Readme for command-line args.

Sossity allows everyone across an organization to access and transform data in real-time, without worrying about code or scale.

Sossity is the link between application code, stream processing, and cloud resources.

Given a set of data sources, sinks, and data pipelines in a config file, Sossity determines the dependencies and cloud resources necessary to execute the pipelines. It then creates a Terraform resource file that describes the individual cloud services to create.

Sossity is written in Clojure, because the language has succinct ways to transform and apply data structures -- especially graphs.

Sossity follows several steps:

  1. Check config file for errors
  2. Build dependency graph
  3. Verify dependency graph
  4. Output Terraform file

Check Config File

Sossity uses Plumatic Schema to define data structures for the config files (config.clj and test_config.clj). It applies the schema to the file, and notifies the user of any missing/misspelled/different fields. There are also some "sanity checks", including making sure there are not multiple resources with the same name, etc. There are several checks at each stage of the planning process.

Build Dependency Graph

A dependency graph is a collection of nodes and edges describing dependencies between resources and the order they must be built. Unlike other workflow systems, Sossity is not a Directed Acyclic Graph, because error pipelines can return messages to their originator.

Example dependency:

Graph Example

There are 4 main resources in Sossity:

  1. Sources
  2. Pipelines
  3. Sinks
  4. Edges

Sources are data ingestion points -- currently, only REST endpoints managed by App Engine. Pipelines are Cloud Dataflow jobs. Sinks are Kubernetes images which consume PubSub messages and write to an external source (files, external APIs, or BigQuery). Edges are PubSub queues.

Graph of Terraform resources and their dependencies:

Graph Resources

Sossity uses Loom as its graph library.

The steps that Sossity takes to create an entire Flow:

  1. Create graph nodes from sources, sinks, and pipelines. A node is an operation on data.
  2. Create graph edges from targets in edges in config file. An edge is a communication channel for data between nodes.
  3. Annotate nodes with metadata from config file, such as name, bucket, transform-jar.
  4. Calculate dependencies of each node based on its ancestor edges (incoming data) and descendant edges (outgoing data)
  5. Name resources based on dependencies, for example, an edge might be named pipeline1-to-pipeline2.
  6. Using metadata and connectivity, create Terraform resources in file.

Create Graph Nodes

Sossity looks at all sources, sinks, and pipelines and creates a node data structure for each one.

Create Edges

From the edges config file entry, Sossity connects every node to its relatives using an edge data structure.

Annotate Nodes

Sossity then applies all the metadata from the config file to every node and edge -- this includes information like output buckets, executable jars, etc.

Calculate Dependencies

Because of the transitive property, a resource only needs to know its own dependencies, because an in-order traversal of the graph will assure its relatives' dependencies are built as well.

Each resource has different dependencies -- all resources need at least one PubSub, some have other needs (Subscriptions, Cloud Storage buckets, etc.)

Name Resources

Sossity tries to follow a "Convention-over-Configuration" method to name resources. Naming is important, because it allows resources to be easily identifiable and unique.

Names of nodes and pipelines are determined in the config file. PubSubs are named after their inputs and outputs (e.g., pipeline1-to-pipeline2). Error PubSubs are named after their parent (pipeline1-to-pipeline1-error). Error sinks are also named after their parent (pipeline1-error).

Output Terraform File

A large amount of the Sossity codebase is dedicated to translating the Dependency Graph into a Terraform file. To do this, the metadata for each node and edge is turned into a Terraform resource. For example, a Pipeline creates a Cloud Dataflow job, a Sink is turned into a Kubernetes Container, and a Source is turned into an App Engine Module. See the Sossity Output for more details.

Error Handling

Since every pipelines' errors are output to a queue, you can also write a pipeline jar to read the queue and re-insert the data into the main wokflow. Simply indicate the :repair-jar in the config file entry for the pipeline and Sossity will attach it with the correct inputs and outputs.

Error output files can also be read by a Batch Cloud Dataflow job and re-inserted to the main workflow. See this example project (soon) for more details.