Sossity Config File Formats - 22Acacia/sossity GitHub Wiki

File Format

The Sossity file format is the heart of the pipelines: it describes how data gets into the system, which pipelines transform it, how pipelines communicate with each other, and where the data is output.

There are two files: config.clj and test_config.clj (for simulation only).

The file is in Clojure map object format for easy readability and editability.

Multiple config.clj files can be given on the Sossity command line, usually consisting of the :pipelines,:sources,:sinks, and :edges sections. Sossity merges these sections. See Sossity readme for more information.

Note: there is a bug in the Cloud Dataflow SDK that doesn't allow us to use bucket to describe where jar files are stored, so instead we use the word pail.

Example

config.clj

{:config    {:remote-composer-classpath "/usr/local/lib/angleddream-bundled-0.1-ALPHA.jar"
             :remote-libs-path          "/usr/local/lib"
             :sink-resource-version "1"
             :source-resource-version "1"
             :appengine-gstoragekey "hxtest-1.0-SNAPSHOT"
             :default-sink-docker-image "gcr.io/hx-test/store-sink"
             :error-buckets             false
             :system-jar-info {:angleddream {:name "angleddream-bundled-0.1-ALPHA.jar"
                                             :pail "build-artifacts-public-eu"
                                             :key  "angleddream"}
                               :sossity     {:name "sossity-0.1.0-SNAPSHOT-standalone.jar"
                                             :pail "build-artifacts-public-eu"
                                             :key  "sossity"}}}
 :cluster   {:name        "hxhstack" :initial_node_count 4 
             :master_auth {:username "hx" :password "hstack"}
             :node_config {:oauth_scopes ["https://www.googleapis.com/auth/compute"
                                          "https://www.googleapis.com/auth/devstorage.read_only"
                                          "https://www.googleapis.com/auth/logging.write"
                                          "https://www.googleapis.com/auth/monitoring"
                                          "https://www.googleapis.com/auth/cloud-platform"]
                           :machine_type "n1-standard-1"}}
 :opts      {:maxNumWorkers   "1" :numWorkers "1" :zone "europe-west1-c" 
             :workerMachineType "n1-standard-1"   :stagingLocation "gs://hx-test/staging-eu"}
 :provider  {:credentials "${file(\"/home/ubuntu/demo-config/account.json\")}"
             :project "hx-test"}
 :containers {"riid" {:image "gcr.io/hx-trial/responsys-resource:latest" :resource-version "v4"} }
 :pipelines {"testpipeline"
             {:transform-jar "identitypipeline-0.1-ALPHA.jar"
              :pail "build-artifacts-public-eu"
              :key "identitypipeline"}
             "orionidentitypipe"
             {:transform-jar  "identitypipeline-0.1-ALPHA.jar"
              :pail "build-artifacts-public-eu"
              :key "identitypipeline"}
             "orionbqfilter"
             {:transform-jar "orion-transform-0.1-ALPHA.jar"
              :pail "build-artifacts-public-eu"
              :key "orion-transform"}}
 :sources   {"testendpoint" {:type "gae"}
             "orion"        {:type "gae"}}
 :sinks     {"testsink"     {:type "gcs" :bucket "testsink-bucket"}
             "orionsink"    {:type "gcs" :bucket "orionsinkbucket"}
             "orionbq"      {:type "bq" 
                             :bigQueryDataset "hx_orion_staging" 
                             :bigQueryTable "hx_orion"
                             :bigQuerySchema "/home/ubuntu/demo-config/orion.json"}}
 :edges     [{:origin "testendpoint" :targets ["testpipeline"]}
             {:origin "testpipeline" :targets ["testsink"]}
             {:origin "orion" :targets ["orionbqfilter" "orionidentitypipe"]}
             {:origin "orionidentitypipe" :targets ["orionsink"]}
             {:origin "orionbqfilter" :targets ["orionbq"]}]}

:config section

General Sossity configuration.

:remote-composer-classpath- path to angleddream on Sossity CircleCI node. Rarely changes.

:remote-libs-path - path on Sossity CircleCI node where Pipeline jars are compiled to. Rarely changes.

:sink-resource-version - Rarely needed. Increment this to force a re-deploy of all sink containers in Terraform. This is a bit of a hack to get around the fact that we can't easily detect when a new Docker image (the file sinks are docker images) is written to the repository.

:source-resource-version - see above, but for the App Engine REST endpoints for ingestion.

:error-buckets - automatically create sinks and file buckets to capture all data from error pipelines. Each sink is a small GCE node.

:system-jar-info - name, bucket, and key for angleddream and sossity in Google Cloud Storage

:appengine-gstoragekey - Google Storage key where the source (App Engine) jar resides

:default-sink-docker-image - default docker image to use for the sinks

:cluster section

Kubernetes cluster configuration. Used for sinks right now.

:name - Kubernetes cluster name.

:initial-node-count - number of physical machines in Kubernetes cluster

:master_auth {:username :password} Kubernetes master authentication info --rarely used

:node_config {:oauth_scopes}: OAuth scopes for cluster nodes. Should never change.

:node_config {:machine_type}: GCE instance type for nodes.

:opts section

Command line options passed to every Angled-Dream/Cloud Dataflow job. Capitalization matters.

:maxNumWorkers - maximum number of workers per pipeline.

:numWorkers - initial number of workers per pipeline.

:zone - GCE zone for workers

:workerMachineType - GCE instance type

:stagingLocation - where intermediate jars and files are stored for every cloud dataflow job. Needs to be in same region as compute nodes.

:containers section

"name" - container name

:image - Docker image path

:resource-version hardcoded container version. change this to trigger a resource redeploy in Terraform.

:provider section

Provides Google account information.

:credentials - path to credentials on CircleCI node. Should rarely change.

:project VERY IMPORTANT - Google Cloud project (where everything resides)

:pipelines section

Pipelines are Cloud Dataflow jobs, with one PubSub input and multiple PubSub outputs.

{"pipeline_name" {...}} - pipeline name. used throughout config. Can only contain [a-z, -, A-Z, 0-9]. Don't actually use the word "pipeline_name".

:transform-jar - name of jar file compiled using the angled-dream SDK.

:pail - Google Cloud Storage bucket of compiled jar

:key - key to GCS path containing jar file

:container-deps - containers this pipeline job can request data from. the container name and IP address are passed to the pipeline via angled-dream on a CLI arg, e.g. --containerDeps=riid|192.158.30.2,hapi|129.33.2.1

:sources section

A source can only have one output, for reliability and performance at the REST endpoint. Use an identity pipeline to "fan-out" to multiple pipelines if needed.

Sources are at the head of a Pipeline flow -- currently they are autoscaling REST endpoints backed by Google App Engine.

{"source_name" {...}} - source name. used throughout config. Can only contain [a-z, -, A-Z, 0-9]. Don't actually use the word "source_name".

:type - source type, right now only gae for Google App Engine.

:sinks section

Sinks are at the tail of a Pipeline flow -- usually powered by a program on a Kubernetes container which consumes the PubSub and outputs somewhere.

{"sink_name" {...}} - sink name. used throughout config. Can only contain [a-z, -, A-Z, 0-9]. Don't actually use the word "sink_name"

:type - sink type. Can only be gcs (Google Cloud Storage) or bq (BigQuery) currently.

:replicas - number of workers to read from 1 pubsub

:bucket - Google Cloud Storage bucket to output JSON to.

:bigQueryDataset - BigQuery Dataset destination

:bigQueryTable - BigQuery table destination

:bigQuerySchema - CircleCI path and file to BigQuery schema. Should be in same GitHub repo as config.clj.

:edges section

Edges describe how sources, pipelines, and sinks move data to each other. Sossity takes care of the dependencies and resources.

:origin - data origin -- can be a name of a pipeline or source. Only one origin per pipeline/source.

:targets[] - vector of target pipelines or sinks. One origin can have many targets.

testconfig.clj

This file is for the Sossity Simulator only. It provides local information about pipeline jars and other test data.


{:config    {:test-output "testoutput/"}
 :pipelines {"testpipeline"      {:local-jar-path "target/"
                                  :composer-class "com.acacia.identitypipeline"}
             "orionidentitypipe" {:local-jar-path "target/"
                                  :composer-class "com.acacia.identitypipeline"}
             "orionbqfilter"     {:local-jar-path "target/"
                                  :composer-class "com.acacia.samples.oriontransform"}}
 :sources   {"testendpoint" {:test-input "test-data/testendpoint.json"}
             "orion"        {:test-input "test-data/orion.json"}}}}

:config

:config :test-output - directory (absolute or relative) to output test result JSON to

:pipelines

:local-jar-path - local location of a pipeline's jar file

:composer-class - name of class in jar file which implements AbstractTransformComposer. There should be only one.

:sources

:test-input - path to file containing well-formatted JSON (meaning all objects contained within a [] array), to simulate input to named source