Protobuf - kamialie/knowledge_corner GitHub Wiki
Protobuf
Protocol buffers is Interface Description Language (IDL). Protoc is a protobuf source to source compiler - takes a .proto file and returns code in specified language.
Schema is not serialized, only metadata (numbers) and types do.
Schema example:
syntax = "proto3";
message Book {
string title = 1;
string subtitle = 2;
uint32 year = 3;
repeated string authors = 4;
string isbn = 5;
}
Message structure
Message is a collection of fields, also can be compared to an object or struct.
Message definition:
syntax = "proto3"
message Book {
string title = 1;
string subtitle = 2;
uint32 year = 3;
repeated string authors = 4;
string isbn = 5;
}
Field consists of type, name and tag (unique identifier), e.g. string title = 1
.
Types include integers (int32, sint64, etc), floating point precision (float, double), boolean, length delimited (string, bytes).
Name is not serialized, and, thus, isn't important in the data itself, but is used in the language of choice to access a particular field.
Smallest tag is 1, largest tag is 536,870,911, reserved tags - 19000 to 19999.
Every field is optional (at least in proto3). This means that every field type has a default value. Default value for integers and floats is 0. Boolean - false, length delimiter - empty array. There is no way to distinguish no value set (default is applied) and actual value that is equal to the default. Thus, do not assign business logic to default values and/or implement nullable types.
Data types
Enumeration
Enums are used for cases with exhaustive list of states. Smallest tag is 0,
which is often defined as UNSPECIFIED
, and acts as a default for the
enumeration.
enum FileType {
UNSPECIFIED = 0;
MP3 = 1;
MP4 = 2;
JPEG = 3;
}
List
There are cases where a field can have multiple values, 0 or more. repeated marks the field as list of type. Deserializing such field yields iterable. Default value is empty list.
message User {
repeated string middle_name = 1;
}
Maps
Messages, enums, and other types can be used as values. Only simple types can be used as keys, e.g. strings, integers, but not floats and bytes. repeated can not be used with maps. Default value is empty map.
message PhoneBook {
map<string, string> contacts = 1;
}
OneOf
oneOf
is useful for cases with mutually exclusive data, which could also
hold additional information. Options with type boolean can be simply
substituted with enum, however, oneof
options can be of different and even
complex types. Default value is no value set.
message Cat {}
message Dog {}
message CatOrDog {
oneof result {
Cat cat = 1;
Dog dog = 2;
}
}
- Can only be defined inside messages
- Maps or lists can not be used inside
oneof
oneof
itself can not be repeated.
Code structure
Import
Messages can be organized into multiple files. To access a message from another file simply import it. Best practice is to list import statements in alphabetical order.
import "filename.proto";
Package
Packages group messages/files together and give a broader context. In the
example below a
is a parent package, b
is a child package (more children
can be defined as well). Generally a package is defined before import
statements.
package a.b;
Message defined in a package makes it relevant only within that package. For example, another message can make use of it, if it is also defined in the same package.
file.proto
:
package example.fs;
message File { /*...*/ }
folder.proto
:
package example.fs
import "file.proto";
message Folder {
repeated File files = 1;
}
To use a message from a different package a fully qualified name must be used. If both messages share part of the package name, the matching part can be dropped.
package example.fs; import
"google/protobuf/timestamp.proto";
message File {
google.protobuf.Timestamp created_at = 1;
}
Nested message
With nested messages inner classes are only relevant in the context of the parent class.
message Cat {
enum Breed {
UNSPECIFIED = 0;
BENGAL = 1;
BURMESE = 2;
}
Breed breed = 1;
}
message Dog {
enum Breed {
UNSPECIFIED = 0;
DALMATIAN = 1;
DOBERMANN = 2;
//…
}
Breed breed = 1;
}
Nested fields can still be referenced via fully qualified name, e.g.
Cat.Breed
; thus, nested fields do not need to be additionally prepended
with the parent name.
Compiler
protoc
is a protobuf compiler. Not all languages are supported out of the
box; for unsupported languages a plugin may be installed.
Popular CLI options:
# --$(language)_out
# specify desired language to generate code in. Accepts a path where to place
# the generated code (relative to passed proto files?).
$ protoc --python_out=. example.proto
# -I or -proto_path
# Specify path where to search for imports. Can be specifed multiple times, in
# which case directories are searched in order.
$ protoc -Iimports
Encode and decode data
--encode
parameter takes a message type that is used to serialize input
(passed via stdin). If message is defined in a package, a fully qualified
name should be specified. Produces a binary output.
cat course.txt | protoc --encode=Course course.proto > course.bin
--decode
option does the same as --encode
, but in reverse - takes a binary
input and produces a text output. Order of repeated and map fields is not
guaranteed.
cat course.bin | protoc --decode=Course course.proto > course-decoded.txt
--decode_raw
is similar to --decode
, but doesn't require a message
type. Produces a text output with tags. Can be used to reverse engineer
the message type that was used to produce that data.