Discussions around Serialization Systems and Schema Registry - co-jo/pravega GitHub Wiki

In context of discussing how schema registry can be beneficial for different user scenarios we classified applications into different categories and then looked at different serialization systems and how a schema registry could be enabler for these different classes of applications.

We had a valuable discussion around classifying different serialization systems and their encoding schemes and we discussed how we could make the structure of the data available to pravega reader applications that can benefit from it. Capturing the discussion in a wiki for benefit of others who want to have a deeper view of design choices we made for schema registry.

Application classification:

  1. Class 1 applications: The class of applications that work with specific structure of data and deserialize all events into this specific structure or generated objects.
  2. Class 2 applications: These are generic applications that in principle can interpret streaming data by dynamically reacting to the structure of the data at write time (or choose/receive a compatible schema dynamically at run time and apply schema on read). Examples of such applications are SQL CLI, UI sample data etc.

The schema registry is particularly useful for class 2 applications by providing a mechanism of bringing in knowledge of structure of data in a stream.

For class 1 applications that use formats like avro which need writer schema at deserialization, warrant presence of a registry.

But it is important to discuss about these applications in the context of different serialization systems and how schema registry can be relevant for a subset of these.

Following is an excerpt from the classification @Tom Kaitchuck defined (with slight restructuring).

A) Formats that assume structure being written (schema is code)
- Bincode
- Most custom formats

Event formats in group A lack a formal schema even if there is a spec for the format it is being written in because the 
structure of the individual message is dependent on the data being written. As such there is generally no way to read data
 written in these formats in a "class 2" app. A schema registry does not help here because there is no schema to register.

B) Formats that use external keys (but not structure). (Schema is used to do codegen)
- Protobufs
- Thrift
- Ion (in compact mode)

Events in format group B have a format specification. It's implicitly used by their readers and writers as they have code
 generated from the schema or vice versa. 
These application's data can be read by a "class 2" app. However this requires special format specific translation logic to
 download the schema for the data and preform translation into a more generic format. (This is separate from the code that
 the applications is using to read and write the data) For this to work the "class 2" application would need to be able to 
obtain some schema that can decode the data (It is ok if it is an older or newer version). So this requires a way to 
download schemas at runtime. 

C) Formats that include keys and structure (No schema needed)
- Json
- Bson
- Cbor
- MessagePack

Events in format group C do not necessarily require an external schema and are their own format specification. "Class 1" 
apps can add and remove fields at will, and will only break other applications if it changes a semantic another app depends 
on. (This cannot be programmatically enforced). "Class 2" apps love these formats because they can just grab any old data 
off the wire and decode it. No schema is needed as it can be inferred from the data itself. 

D) Language specific formats. (Serialized versions of classes)
- Java serialization
- Pickle

I don't know that we need to support Language specific formats like Pickle because these aren't generally used in Streams 
because it limits compatability. To the extent that we might support them, they rarely are used in conjunction with "class 
2" applications, but if they were it would be via a "SerDe".

Av) Avro - Assumes structure being written by explicitly obtaining the write time structure.
Avro strongly depends on having a registry, and is why most of the registries that exist were created. It needs one to:
Register writer schema and add a reference to it with the serialized data.
Check that all schema changes are compatable at registration time.
Register all schema changes to the repo before any code can use them.

G) Graphql - Graphql servers provide explicit explorations and interactive definition of the schema by the client at runtime.
Graphql totally different from the above, but was designed to avoid the need for a schema registry by allowing the caller 
to define the schema they wish to receive data in. 
  1. Group A:

A "SerDe" repository that provides applications with a SerDe (short for SerializerDeserializer) implementation for data in a stream. This can be supported for languages like java or python where we deserialize the object into either a queryable structure like a Map or use reflection on a Java object to get the structure.

This can be a future direction that can support schemaless serialization systems.

  1. Group B:

These are formats that have an external schema that describes the structure of the data. As such Class 1 applications do not require the writer schema. However they can benefit from a central service that validates and ensures schemas evolve in conformance with chosen compatibility strategy for a stream, whereby safeguarding downstream applications.

The registry is particularly beneficial for class 2 applications for format group B.

  1. Group C:

Even though these formats are typically self describing, there is still merit in using and sharing a schema to describe the structure and validation constraints. These can be useful for Class 2 writer apps that read data from some data source ingest them into a stream as JSON document. A JSON schema can also describe field type and assign semantic meaning like whether a string is Date or an email.

  1. Group D:

Group D is similar to group A in its characteristic that there may not be an external schema that describes the structure of the data. This makes it a candidate for "SerDe".

  1. Group Av:

Both class 1 and class 2 class 1 and class 2 applications require a registry to work with avro serialization format. An alternative to the registry would have been to include schema with the serialized data, which would be a costly proposition. Avro also provides very good out of box libraries to compare schemas for compatibility and hence a registry can easily provide strong support for evolution of avro schemas that satisfy compatibility strategy for a stream.