Protobuf - kamialie/knowledge_corner GitHub Wiki

Protobuf

Protocol buffers is Interface Description Language (IDL). Protoc is a protobuf source to source compiler - takes a .proto file and returns code in specified language.

Schema is not serialized, only metadata (numbers) and types do.

Schema example:

syntax = "proto3";

message Book {
  string title = 1;
  string subtitle = 2;
  uint32 year = 3;
  repeated string authors = 4;
  string isbn = 5;
}

Message structure

Message is a collection of fields, also can be compared to an object or struct.

Message definition:

syntax = "proto3"

message Book {
  string title = 1;
  string subtitle = 2;
  uint32 year = 3;
  repeated string authors = 4;
  string isbn = 5;
}

Field consists of type, name and tag (unique identifier), e.g. string title = 1.

Types include integers (int32, sint64, etc), floating point precision (float, double), boolean, length delimited (string, bytes).

Name is not serialized, and, thus, isn't important in the data itself, but is used in the language of choice to access a particular field.

Smallest tag is 1, largest tag is 536,870,911, reserved tags - 19000 to 19999.

Every field is optional (at least in proto3). This means that every field type has a default value. Default value for integers and floats is 0. Boolean - false, length delimiter - empty array. There is no way to distinguish no value set (default is applied) and actual value that is equal to the default. Thus, do not assign business logic to default values and/or implement nullable types.

Data types

Enumeration

Enums are used for cases with exhaustive list of states. Smallest tag is 0, which is often defined as UNSPECIFIED, and acts as a default for the enumeration.

enum FileType {
  UNSPECIFIED = 0;
  MP3 = 1;
  MP4 = 2;
  JPEG = 3;
}

List

There are cases where a field can have multiple values, 0 or more. repeated marks the field as list of type. Deserializing such field yields iterable. Default value is empty list.

message User {
  repeated string middle_name = 1;
}

Maps

Messages, enums, and other types can be used as values. Only simple types can be used as keys, e.g. strings, integers, but not floats and bytes. repeated can not be used with maps. Default value is empty map.

message PhoneBook {
  map<string, string> contacts = 1;
}

OneOf

oneOf is useful for cases with mutually exclusive data, which could also hold additional information. Options with type boolean can be simply substituted with enum, however, oneof options can be of different and even complex types. Default value is no value set.

message Cat {}
message Dog {}

message CatOrDog {
  oneof result {
    Cat cat = 1;
    Dog dog = 2;
  }
}
  • Can only be defined inside messages
  • Maps or lists can not be used inside oneof
  • oneof itself can not be repeated.

Code structure

Import

Messages can be organized into multiple files. To access a message from another file simply import it. Best practice is to list import statements in alphabetical order.

import "filename.proto";

Package

Packages group messages/files together and give a broader context. In the example below a is a parent package, b is a child package (more children can be defined as well). Generally a package is defined before import statements.

package a.b;

Message defined in a package makes it relevant only within that package. For example, another message can make use of it, if it is also defined in the same package.

file.proto:

package example.fs;

message File { /*...*/ }

folder.proto:

package example.fs

import "file.proto";

message Folder {
  repeated File files = 1;
}

To use a message from a different package a fully qualified name must be used. If both messages share part of the package name, the matching part can be dropped.

package example.fs; import

"google/protobuf/timestamp.proto";

message File {
  google.protobuf.Timestamp created_at = 1;
}

Nested message

With nested messages inner classes are only relevant in the context of the parent class.

message Cat {
  enum Breed {
    UNSPECIFIED = 0;
    BENGAL = 1;
    BURMESE = 2;
  }

  Breed breed = 1;
}

message Dog {
  enum Breed {
    UNSPECIFIED = 0;
    DALMATIAN = 1;
    DOBERMANN = 2;
    //…
  }

  Breed breed = 1;
}

Nested fields can still be referenced via fully qualified name, e.g. Cat.Breed; thus, nested fields do not need to be additionally prepended with the parent name.

Compiler

protoc is a protobuf compiler. Not all languages are supported out of the box; for unsupported languages a plugin may be installed.

Popular CLI options:

# --$(language)_out
# specify desired language to generate code in. Accepts a path where to place
# the generated code (relative to passed proto files?).
$ protoc --python_out=. example.proto

# -I or -proto_path
# Specify path where to search for imports. Can be specifed multiple times, in
# which case directories are searched in order.
$ protoc -Iimports

Encode and decode data

--encode parameter takes a message type that is used to serialize input (passed via stdin). If message is defined in a package, a fully qualified name should be specified. Produces a binary output.

cat course.txt | protoc --encode=Course course.proto > course.bin

--decode option does the same as --encode, but in reverse - takes a binary input and produces a text output. Order of repeated and map fields is not guaranteed.

cat course.bin | protoc --decode=Course course.proto > course-decoded.txt

--decode_raw is similar to --decode, but doesn't require a message type. Produces a text output with tags. Can be used to reverse engineer the message type that was used to produce that data.

Links