Guide for developers - snowplow-archive/sauna GitHub Wiki
We welcome contributions to Sauna! This guide is for developers that wish to contribute to the core platform or add a new integration.
If you are a data engineer looking to integrate Sauna into ETL pipelines, please check out the Guide for analysts first.
This guide covers Sauna v.0.1.0.
Sauna 0.1.0 is built on top of the Akka akka [Actor model] actor-model, representing most of its' core entities as actors - independent units of computation, storing a state and connection to parents and (optionally) children.
Sauna's entities are configured using self-describing Avro entities. The SaunaOptions
object contains a parser which parses command-line arguments and extracts SaunaSettings
, which in turn are immutable snapshots of configurations.
Sauna starts with a root "supervisor" actor, called Mediator
. This is the most stateful actor, mediating other entities and routing messages between them. It stores a state with information about which events were sent to which entities to maintain consistency and warn the user if some events are lost (instead of just awaiting for response from the responder, which can take quite a while to process.)
Unlike other actors, which can immediately recover from a crash, a crash of Mediator
is effectively a crash of the entire application. (This also means Mediator
is essentially a singleton. So far Sauna can be run only on a single machine, but in the future it will be a cluster handled by a similar mediator.)
Mediator
holds references to all core entities: observers, responders and loggers. These core actors can communicate with Mediator
simply by sending a message to context.parent
. The references are constructed depending on the SaunaSettings
object, which is passed as an argument of Mediator
's constructor.
Observers are actors which are a source of lowest-level events. Observer events are usually raw ("file created", "HTTP request received") and not necessary awaited by responders. Currently only two "batch" events - LocalFilePublished
and S3FilePublished
- are supported, both of which can be intercepted and processed by any of Responder
s. In the future, we're planning to support a wide variety of different events coming from various SaaS vendors and applications, which can have streaming/batch/singleton/etc. nature.
Usually when an Observer
emits an event, it "bubbles" up to Mediator
and then gets broadcasted to all Responder
s, which in turn decide whether this event should be processed and acted on or skipped. Observer events must contain a path
to an underlying file or S3 object, streamContent
to iterate over the file's/object's content and observer
to help Mediator
keep track of sources that are currently processing events.
Observers can use some auxiliary constructs, such as dedicated threads to poll services.
Responders are the bread and butter of Sauna. These entities contain processing logic and carry out various actions depending on events they receive. Each responder extends the Responder[RE]
trait, where RE
is a specific responder event type.
A responder event is an entity which can be extracted by a specific responder from an observer event (described above) using the extractEvent
method. If the extraction is successful, the enriched event is passed to the extracted event will be passed to the process
method which contains all custom logic. process
should be non-blocking and must send a message inherited from ResponderResult
when processing is finished.
Objects in this package are designed to communicate with third-party APIs. Optimizely
and Sendgrid
have implemented several necessary features.
Loggers provide feedback to end users based on messages received from observers and/or responders. There are two types of messages:
- Structured
Manifestation
(conforms to a specified schema; can have several required fields). - Unstructured
Notification
(a simple wrapper overString
).
DynamodbLogger
can be used for structured messages, HipchatLogger
- for unstructured, and StdoutLogger
for both of them at once.
Shares common variables (TSVFormat
, WSClient
) in single instance.