binary file format Avro - ghdrako/doc_snipets GitHub Wiki

Avro is an open source, row-based data serialization format for creating Avro data files. Avro is one of the most preferred formats for Hadoop ecosystems. Organizations use Avro extensively in modern data platforms for persisting large volumes of raw data for further ETL processing.

Avro is row-based, so it stores all the fields for each record together. This makes it the best choice for situations where all the fields for a record need to be accessed together.

Avro stores data in a binary format and data definitions in a JSON dictionary. It puts both the data and the schema together in a single file or message, so everything a program needs to process the data is in one place. Unlike similar systems like Protocol Buffers, Avro clients don’t need generated code to read messages. This makes Avro an excellent choice for scripted languages.

An Avro file consists of a header and one or more data blocks. The header consists of file metadata, including the schema, name of the compression code, and a randomly generated sync marker that helps in splitting the file. Each data block consists of the following:

  • Count of objects in that block
  • Size of that block (post-compression)
  • Data stored as serialized objects in its compressed form
  • File’s 16-bit sync marker obraz

This layout helps to efficiently extract or skip data while also detecting any corrupt blocks.

Feature:

  • The schema is stored as JSON or in Avro IDL (AVDL) format, which makes it human-readable, while the actual data is stored as binary to get better compression ratios.
  • Avro supports schema evolution, making it an excellent choice for implementing ETL-like workloads to accommodate schema changes.
  • Avro file is splitable, which means it can be used in distributed processing even in its compressed form, thus helping in faster data processing.
  • It supports multiple languages, including JAVA, C, C++, C#, and Python.

One of Avro’s key benefits is robust support for schema changes via a mechanism called schema evolution. This feature gives Avro the ability to gracefully handle missing, new, and altered fields, making Avro data both backward and forward compatible.

When Avro decodes data, it can use two different schemas. One comes from the encoded data and reflects what the publisher encoded, and the other is from the reader and indicates what they expect. Avro will work out the differences so the reader has a usable result.

Avro supports primitive (boolean, int, long, and string), complex (enumerations, maps, arrays, and user-defined records), and nested data types.

The two primary uses for Avro are data serialization and remote procedure calls.

Here is an example schema file for the Avro format. Notice how we have two “columns” of data called Make and ID in the awesome_startup namespace:

{
  "type" : "record",
  "namespace" : "awesome_startup",
  "name" : "cars",
  "fields" : [
     { "name" : "Make" , "type" : "string" },
     { "name" : "ID" , "type" : "int" }
  ]
}

Avro is a popular open standard protocol often found in data lakes. Avro serializes data, making it small and compact compared to formats such as JSON. Avro supports both structured and semi-structured data like JSON. Avro has a secondary file that is in JSON format that defines the data types and structure of your data. With this schema file, you can evolve your schema by making changes but keep backward compatibility.

Avro is designed to be accessed row by row or row storage.Row storage is ideal for cases when you look up a row and read the whole row.

Disadventage

What happens when the schema file is lost? Typically, the data is unusable, which is less than ideal. Row storage is perfect for CRUD-style workflows, but many data-intense workflows will read a whole column at a time; this can be costly in Avro.

{
  "type": "record",
  "name": "CollectingFeeApproaching",
  "namespace": "events.deadline.notification.v1",
  "fields": [
    {"name": "currency", "type": "string", "doc": "Waluta  wyliczeniowa. ISO 4217",
      "default": "PLN"
    },
    {"name": "amount", "type": {"type": "bytes", "logicalType": "decimal","precision": 15,"scale": 2}, "doc": "Kwota opłaty",
      "default": "0"
    },
    {"name": "missingTransactionAmount", "type": {"type": "bytes", "logicalType": "decimal","precision": 15,"scale": 2}, "doc": "Brakująca wartość transakcji",
      "default": "0"
    },
    {"name": "missingTransactions","type": "int", "doc": "Brakująca ilość transakcji",
      "default": 0
    },
    {"name": "chargeCycleEndDate", "type": {"type": "int", "logicalType": "date"}, "doc": "Data końca cyklu opłaty"},
    {"name": "withMobileTransactions","type": "boolean", "doc": "Flaga transakcyjności mobilnej mówiąca o tym, czy transakcyjność karty powiązana z opłatą uwzględnia transakcje mobilne",
      "default": false
    },
    {"name": "pid","type": "string", "doc": "Identyfikator klienta"},
    {"name": "cardNumber","type": "string", "doc": "Numer karty dla której wygenerowana została notyfikacja"}
  ]
}