1.4 Encoding and Evolution - KeynesYouDigIt/Knowledge GitHub Wiki

from Designing-Data-Intensive-Applications

Formats for Encoding Data

We must comprehensively encode files so that they can be sent and saved.

When data is in memory, it uses pointers, which point to specific places in memory that the OS has chosen to store data. These pointers are arbitrary - unless the file is a memory map or falls under a specific type like columnar storage files, we need to remove pointers and connect keys directly with values.

JSON, XML, CSV

  • easy to learn
  • widely adopted
  • fast to develop with (less stable)
  • bigger files than needed
  • schemas can be cumbersome
  • unstable without a schema

flexible (sorta), usually good enough. Can be schematized but it can be costly, only hacks support byte streams.

Binary Encodings

  • schematized out of the box (very stable)
  • small files
  • fast to develop with once fully adopted
  • schematized out of the box (picky, inflexible)
  • learning curve
  • not as widely adopted

ASN.1 is a predecessor

MessagePack, Thrift, Avro, and ProtoBuffs are major players in the schematized binary encoded world. These have a much higher learning curve upfront, but make for stable, statically typed message transfers that are extremely efficient. Schemas are limited to a few primitive data types.

https://developers.google.com/protocol-buffers

Modes of Dataflow

Thru DBs

DB records are just messages to future users.

Backward compatibility is important - if we wrote in schema v.1, v.2 should be able to read it without error.

Forward compatibility is also helpful - if we wrote in v.2, v.1 should be able to read it (think of an app in a canary deployment, with different versions running at the same time. v.2 might write something v.1 needs to read. Also consider the scenario of a downgrade).

Data outlives code. (remember to consider any old data in your system when transitioning schemas)

Proper migrations help with a lot of this, but should be written with all this in mind. https://benchling.engineering/move-fast-and-migrate-things-how-we-automated-migrations-in-postgres-d60aba0fc3d4

Services (REST, RPC)

When building a Service Oriented Architecture (SOA, aka Microservices) the goal is to build services that can be frequently deployed and evolved with minimal/no coordination between teams. Thinking about back/forward compatibility is essential. https://www.fastly.com/blog/microservices-war-stories - has video talk too

REST v SOAP

REST (a philosophy)

  • Highly flexible (its just a general design philosophy)
  • stateless, fault tolerant
  • easy to learn
  • FAST to setup
  • JSON, CSV, XML, etc
  • Overtaking SOAP
  • Simpler, easier to learn, supports any language by default
  • Doesn't try to hard to achieve "location transparency" and mask the fact that its performed remotely
  • Flexibility means less stability: more humans reading docs, guessing at messages to write back and forth
  • less tooling
  • no code gen
  • error handling

SOAP (a protocol)

  • WSDL, which describes an API to software, enables code generation, is stable)
  • Standardized with explicit details for implementation via frameworks
  • Still common in large enterprise systems
  • WSDL is not easily human readable
  • Heavy reliance on tools that read WSDL, can rarely be interacted with from scratch
  • If using a language that isn't supported in your framework, integration is difficult

the problems with RPCs

The goal of RPC is to make remote procedure calls work like local function calls ("location transparency" - which is a terrible name for this).

This is generally really really hard.

  • local functions fail or succeed locally, based on things within the local context. What happens remotely is not as clear - is it the network? the remote machine? some unexpected inner working of the remote procedure that was changed? did our request unexpectedly change remote state? communicating data remotely is also harder (see encoding stuff above)

  • local function speed is really similar call to call - network latency might not be. remote machine delays might not be.

Current directions for RPCs

Most modern stuff admits that its remote, and focuses on it.

Message Passing

A message broker passes messages between processes asynchronously (so its not quite traditional RPC)

Brokers are often flexible, senders don't often receive a reply but just "trust the broker" to do the work.

Examples - RabbitMQ, ActiveMQ, Kafka, history was dominated by proprietary, OSS is taking ground.

Producer -> queue ~= topic -> subscriber ~= consumer (can become producer by publishing new message to topic)

Largely un-schematized bytes of data ((just version the topics for compatibility I guess?))

Distributed Actor Frameworks

Actors are used over threads - Actors represent some group of private state (could be a client)

message delivery not guaranteed (wha? just try again?)

Can be for local OR remote message passing (remotes use a broker)

Examples - AKKA, Orleans (MSFT), Erlang OTP

Summary

THINK about backward and forward compatibility in all of your encoding and dataflow.

There are lots of options, take time to understand the +s and -s and your needs. Get ahead by understanding anti-patterns in each enconding and dataflow option.