Avro - ghdrako/doc_snipets GitHub Wiki

Avro is row-based, so it stores all the fields for each record together. This makes it the best choice for situations where all the fields for a record need to be accessed together.

Avro stores data in a binary format and data definitions in a JSON dictionary. It puts both the data and the schema together in a single file or message, so everything a program needs to process the data is in one place. Unlike similar systems like Protocol Buffers, Avro clients don’t need generated code to read messages. This makes Avro an excellent choice for scripted languages.

One of Avro’s key benefits is robust support for schema changes via a mechanism called schema evolution. This feature gives Avro the ability to gracefully handle missing, new, and altered fields, making Avro data both backward and forward compatible.

When Avro decodes data, it can use two different schemas. One comes from the encoded data and reflects what the publisher encoded, and the other is from the reader and indicates what they expect. Avro will work out the differences so the reader has a usable result.

Avro supports primitive (boolean, int, long, and string), complex (enumerations, maps, arrays, and user-defined records), and nested data types.

The two primary uses for Avro are data serialization and remote procedure calls.

Here is an example schema file for the Avro format. Notice how we have two “columns” of data called Make and ID in the awesome_startup namespace:

{
  "type" : "record",
  "namespace" : "awesome_startup",
  "name" : "cars",
  "fields" : [
     { "name" : "Make" , "type" : "string" },
     { "name" : "ID" , "type" : "int" }
  ]
}

Avro is a popular open standard protocol often found in data lakes. Avro serializes data, making it small and compact compared to formats such as JSON. Avro supports both structured and semi-structured data like JSON. Avro has a secondary file that is in JSON format that defines the data types and structure of your data. With this schema file, you can evolve your schema by making changes but keep backward compatibility.

Avro is designed to be accessed row by row or row storage. Row storage is ideal for cases when you look up a row and read the whole row.

Disadventage

What happens when the schema file is lost? Typically, the data is unusable, which is less than ideal. Row storage is perfect for CRUD-style workflows, but many data-intense workflows will read a whole column at a time; this can be costly in Avro.

{
  "type": "record",
  "name": "CollectingFeeApproaching",
  "namespace": "events.deadline.notification.v1",
  "fields": [
    {"name": "currency", "type": "string", "doc": "Waluta  wyliczeniowa. ISO 4217",
      "default": "PLN"
    },
    {"name": "amount", "type": {"type": "bytes", "logicalType": "decimal","precision": 15,"scale": 2}, "doc": "Kwota opłaty",
      "default": "0"
    },
    {"name": "missingTransactionAmount", "type": {"type": "bytes", "logicalType": "decimal","precision": 15,"scale": 2}, "doc": "Brakująca wartość transakcji",
      "default": "0"
    },
    {"name": "missingTransactions","type": "int", "doc": "Brakująca ilość transakcji",
      "default": 0
    },
    {"name": "chargeCycleEndDate", "type": {"type": "int", "logicalType": "date"}, "doc": "Data końca cyklu opłaty"},
    {"name": "withMobileTransactions","type": "boolean", "doc": "Flaga transakcyjności mobilnej mówiąca o tym, czy transakcyjność karty powiązana z opłatą uwzględnia transakcje mobilne",
      "default": false
    },
    {"name": "pid","type": "string", "doc": "Identyfikator klienta"},
    {"name": "cardNumber","type": "string", "doc": "Numer karty dla której wygenerowana została notyfikacja"}
  ]
}