protobuf - RicoJia/notes GitHub Wiki
========================================================================
========================================================================
-
Motivation: we want to transmit class objects, by serializing it into binaries. Protobuf generated serializatble datastructure in different languages
- pb.h, pb2.py
- challenges: receiving & reading code must be with excatly the same memory layout, endianness
-
Choices
- CSV: types are inferred, and not guranteed. Good for visualization tho
- JSON: javascript Object Notation
- pros: data can take any form, widely accepted in the web, can be read by most languages,
- cons:
- no schema so you can put anything anywhere, JSON won't complain.
- No documentation, no metadata
- Can be quite big, due to repeated keys
- protobuf
- defined in .proto textfile
example.proto syntax = "proto3" message Msg{ int32 id = 1; string first_name = 2; }
- Pros:
- compressed (less CPU usage)
- 3x smaller, 20x faster than XML
- schema: needed to generate code and read data
- fully typed
- can be read by all main languages
- schema can evolve over time
- code is generated automatically!
- compressed (less CPU usage)
- Cons:
- support for some languages might be lacking
- can't open serialized data with a text editor, cuz it's compressed and serialized)
- Applications: RPC frameworks, such as gRPC to exchange data.
- proto3 comes out after 2016 by google
- Pros:
- defined in .proto textfile
-
Links:
-
Data types
- float: 32 bits, double:64bit
- sint32, sint64 are more efficient with negative nums than int32, int64.
- They use a technique called "zigzag", like 1->1, -1->2, 2->3 ... you truly just have one more bit. int64 is always 10 bytes long??
- JSON would encode
26
as bytes representing 2, and 6.
- sint32, sint64 are more efficient with negative nums than int32, int64.
- map
- map<string, Result> results = 2
- Map cannot be repeated
- there's no ordering for map
- Time
- Google.Protobuf.WellKnownTypes.Duration
- Google.Protobuf.WellKnownTypes.Timestamp
- bytes: small image
- enum: if you know all the fields in advance
- in proto2, no need for default as required can tell u that
- For proto, you need to be careful with default
syntax = "proto3" message Msg{ enum EyeColor{ UNKNOWN_EYE_COLOR = 0; // this is the default value of "EyeColor" GREEN = 1; BLACK = 2; // good to have all caps } EyeColor eye_color = 8; // this is the "instance of enum" }
- C++: you just do:
msg->ENUMFIELD
to print the value
- float: 32 bits, double:64bit
-
Tags
- tags from 1 to 15 use 1 byte,
- 16 - 2047 uses 2 bytes
-
Fields:
- repeated: for array or list, can take any number (0 included)
- custom func needed if no size is specified
- // means comments
-
default values are defined in proto3, but not in proto 2 (which has optional, required, etc.)
- bool: false
- string, empty,
- enum: first values
- repeated: empty list
- repeated: for array or list, can take any number (0 included)
-
Naming Convensions: This is for the compiler to work properly.
- CamelCase for msgs
- underscore_separated_names for fields
- CamelCase for Enums CAPITAL_WITH_UNDERSCORE for value names
- Uber's proto guide
- You can have on message file containing another
message Person { ... Date birthday; //Note the msg is defined underneath } message Date{ }
- Or
message Msg{ message Address{ string addr = 1; } repeated Address addr= 10; }
- import a file:
import "dir/3-date.proto"
- "package: " is actually a namespace
//File 1 syntax = "proto3" package myData //File2 import "sub_dir/file.proto" message Person{ myData.Date date = 7; }
- "package: " is actually a namespace
========================================================================
========================================================================
- Installation
# Make sure you grab the latest version curl -OL https://github.com/google/protobuf/releases/download/v3.5.1/protoc-3.5.1-linux-x86_64.zip # Unzip unzip protoc-3.5.1-linux-x86_64.zip -d protoc3 # Move protoc to /usr/local/bin/ sudo mv protoc3/bin/* /usr/local/bin/ # Move protoc3/include to /usr/local/include/ sudo mv protoc3/include/* /usr/local/include/ # Optional: change owner sudo chown $USER /usr/local/bin/protoc sudo chown -R $USER /usr/local/include/google
- protoc stands for proto buffer, c compiler
- To compile the code, do:
protoc -I=PROTO_DIR --python_out=PYTHON_DIR PROTO_DIR/simple.proto
- -I means the dir to search in for imports
- --python_out ... this is how to compile proto
- use
PROTO_DIR/*.proto
- Then, you can see file_pb2.py
========================================================================
========================================================================
- if a field is set to default value, then
print (msg)
wouldn't show the field
- if you have two schemas,
example.proto
,scalar_types/proto
:- if there are same fields in the same order, then one field can decode
- if not (two separate fields), then you will see an empty msg object, no errors.
- oneof
- Callback
bool oneof_dec_cb(pb_istream_t* stream, const pb_field_t* field, void** arg){ switch (field->tag) { case ShelfEvent_shelf_data_tag: { //EventData_camera_data_tag } } }
-
Nanopb would add 4 bytes to the beginning of each message to signify lengt . Link
- if you don't remove it, you will see
error parsing message
- if you don't remove it, you will see
-
we use nanopb because:
- you have to manually alloc to serialize and deserialize stuff. Nanopb handles them for u. Nanopb was created for protobuf
Example mymessage = {42}; uint8_t buffer[10]; pb_ostream_t stream = pb_ostream_from_buffer(buffer, sizeof(buffer)); pb_encode(&stream, Example_fields, &mymessage);
- In c, comaptible with 32 bit microcontrollers.
- you have to manually alloc to serialize and deserialize stuff. Nanopb handles them for u. Nanopb was created for protobuf
-
Varint: variable length int: small int will take up 1 byte.
int32, sint32
are all varints. fixed int arefixed32, sfixed64
- use
_DecodeVarint32 (buf, starting_pos)
to decode
- use
-
work with proto
def __deserialize_msg(self, msg): #TODO n = 0 data_len, data_pos = _DecodeVarint32(msg, n) n = data_pos data_buf = msg[n:n + data_len] n += data_len return shelf_event_pb2.ShelfEvent().FromString(data_buf) def __reserialize_msg(self, proto): """ Reserialize the proto and return it """ gd_serialized = gd.SerializeToString() f = BytesIO() _EncodeVarint(f.write, len(gd_serialized)) f.write(gd_serialized) return f.getbuffer().tobytes()
========================================================================
========================================================================
- Simple_msg
- Simple_msg.proto
syntax = "proto3"; #must specify proto2 or 3 package example.simple; #This is more like a namespace message SimpleMessage{ #This is a very simple msg int32 id = 1; bool is_simple = 2; string name = 3; repeated int32 sample_list = 4; #repeated means list }
- Simple_demo.py
import simple.simple_pb2 as simple_pb2 # generated python msg simple_msg = simple_pb2.SimpleMessage() # Create a msg obj simple_msg.id = 113 simple_msg.is_simple = True simple_msg.name = "test" # this is how to make a list #Method 1: append, make a copy of the number, sample_list = simple_msg.sample_list #You must get a reference to the List object in simple_msg sample_list.append(123) #Then you can append stuff sample_list.append(456) #method 2: extend, working for a message type, not just appending raw numbers simple_msg.sample_list.extend([1,2,3]) with open("simple.bin", "wb") as f: #wb means write, binary mode, output is called simple.bin byteString = simple_msg.SerializeToString() f.write(byteString) with open("simple.bin", "rb") as f: simple_msg_read = simple_pb2.SimpleMessage().FromString(f.read()) print(simple_msg_read.sample_list)
- Complex Msg
- Complex.proto
syntax="proto3"; package example.complex; message ComplexMsg{ DummyMsg one_dummy = 1; repeated DummyMsg multiple_dummy = 2; } message DummyMsg{ int32 id = 1; string name = 2; }
- Complex.demo.py
import complex_pb2 complex_msg = complex_pb2.ComplexMsg() #Below is not correct: you cannot assign field to an object. #Maybe a const pointer. # one_dummy_msg = complex_pb2.DummyMsg() # one_dummy_msg.id = 1 # complex_msg.one_dummy = one_dummy_msg complex_msg.one_dummy.id = 1 complex_msg.one_dummy.name = "hehe" #Method 1: adding a ref to repeated, but sometimes add() does not exist for simple objects msg1 = complex_msg.multiple_dummy.add(); msg1.id = 1 msg1.name = "hehehe" #Method 2: adding a ref to repeated msg2 = complex_msg.multiple_dummy.add(id=2, name="rj"); #Method 3: not recommended: copying msg3 msg3 = complex_pb2.DummyMsg() msg3.id = 3 msg3.name = "hehehehe" complex_msg.multiple_dummy.extend([msg3]) print(complex_msg)
- Complex.proto
========================================================================
========================================================================
- Compilation
protoc --proto_path=src --cpp_out=build/gen src/foo.proto src/bar/baz.proto
========================================================================
========================================================================
-
oneof
msg.tag_inside_oneof = ...
- Only the last field that was set gets to keep its value
- In generated code, the API is the same, except there's one more function for checking the last field.
- each time, there will be a tag generated in the received msg
evt.which_event == Field_Tag
- each time, there will be a tag generated in the received msg
-
Enum: each field is in the msg directly, no Enum field required
metadata1.node_type = oct_command_pb2.WEIGHT
-
Field checks
- see if there's oneof field at all
if proto.WhichOneof("event"):
- see which oneof field:
if proto.HasField("announcement"):
- construct a proto with keyworded arg: (must have keyword)
new_time_info = shelf_event_pb2.TimeInfo(ts=new_ts, us=new_us)
- No assignment is allowed in proto:
proto.event_data.camera_data.time_info_after.CopyFrom(new_time_info_after)
- Optional compound object is empty by default.
#TODO test_proto = shelf_event_pb2.ShelfEvent() print("===============") print("test proto: ", test_proto) test_proto.event_data.CopyFrom(shelf_event_pb2.EventData()) #will show empty object test_proto.event_data.camera_data.CopyFrom(shelf_event_pb2.CameraData()) print("event_data: ", test_proto.event_data) print("has camera_data: ", test_proto.event_data.HasField("camera_data")) #if not initialied, false print("===============")
- see if there's oneof field at all
========================================================================
========================================================================
-
why nanopb over protobuf?
- nanopb is ANSI C, American National Standards, a set of successive standards, suitable for microcontrollers
- nanopb generator is built on top of Google's protoc
- Two ways to convert .proto to c:
# If you have downloaded nanopb binary: generator-bin/protoc --nanopb_out=. myprotocol.proto #if not, protoc --plugin=protoc-gen-nanopb=nanopb/generator/protoc-gen-nanopb ...
-
Key functions:
SimpleMessage message = SimpleMessage_init_zero; /* Create a stream that will write to our buffer. */ pb_ostream_t stream = pb_ostream_from_buffer(buffer, sizeof(buffer)); message.lucky_number = 13; // write to the output stream status = pb_encode(&stream, SimpleMessage_fields, &message); message_length = stream.bytes_written;
- nanopb's encode function pb_encode will encode a message field by field, then it's going to encode tag and the actual value, so it's possible to have the write_callback multiple times. The problem is in the send function itself
- if callback return false, pb_encode will stop.
- actually the callback was called 5 times. A simple message is called 5 times, shelf_event is called 3 times.
- What about repeated fields?
//1. OK
struct decoder_params {
Event* event;
bool before;
decoder_params() : event{}, before{} {
}
// Decode the before packets
decoder_params bparam;
bparam.event = msg.get();
bparam.before = true;
2. before, after set up// Need decode function for: camera_data,
request->camera_data.packets_before.funcs.decode = &packets_decode;
request->camera_data.packets_before.arg = (uint32_t*)&bparam;
// callback function
bool ShelfConnection::packets_decode(pb_istream_t* stream, const pb_field_t* field, void** arg) {
decoder_params* param = ((decoder_params*)*arg);
Event* event = param->event;
auto data = std::make_unique<PacketData>();
data->packet_->extrinsics.funcs.decode = extrinsics_decode;
data->packet_->extrinsics.arg = (void*)&data->extrinsics_;
data->packet_->data.funcs.decode = packet_decode;
data->packet_->data.arg = (uint32_t*)event;
if (!pb_decode(stream, Packet_fields, data->packet_.get())) {
ERROR("CameraPacket decode failure : %s\n", PB_GET_ERROR(stream));
return false;
}
data->type_ = (param->before) ? packet_type::BEFORE : packet_type::AFTER;
DEBUG("Camera Frame : ID:%d, type:%d, before:%d size:%d, cap ts:%u, enc ts:%u", data->packet_->camera_id,
data->packet_->frame_type, data->type_, data->packet_->size, data->packet_->cap_ts, data->packet_->enc_ts);
event->packets_.emplace_back(std::move(data));
return true;
}
-
decoder
- Fun: how decode is defined?
bool (*callback)(pb_istream_t *stream, pb_byte_t *buf, size_t count);
- pb_callback_t has callback and arg in pb_common.h
decoder_params bparam; request->camera_data.packets_before.funcs.decode = &packets_decode; msg.cb_event.arg = (unit32*)&bparam // bool ShelfConnection::packets_decode(pb_istream_t* stream, const pb_field_t* field, void** arg){ decoder_params* param = ((decoder_params*)*arg); //why void **? Event* event = param->event; }
- Fun: how decode is defined?
-
Oneof decoding good example
-
we need special encoding & decoding callbacks only for 1. sub messages 2."array" types with unknown size, such as string, array (repeated), 3. msg level callback for oneof msg
- If any data remains in stream, normal decoding will continue.
- E.g
// The proto file message OneOfMessage{ option (nanopb_msgopt).submsg_callback = true; int32 prefix = 1; oneof values{ int32 intvalue = 5; string strvalue = 6 [(nanopb).max_size = 8]; } bool print_int32(pb_istream_t *stream, const pb_field_t *field, void **arg) { uint64_t value; if (!pb_decode_varint(stream, &value)) printf("%d", value); return false; return true; } // The main file /* The callback below is a message-level callback which is called before each * submessage is encoded. It is used to set the pb_callback_t callbacks inside * the submessage. The reason we need this is that different submessages share * storage inside oneof union, and before we know the message type we can't set * the callbacks without overwriting each other. */ bool msg_callback(pb_istream_t *stream, const pb_field_t *field, void **arg) { /* Print the prefix field before the submessages. * This also demonstrates how to access the top level message fields * from callbacks. */ OneOfMessage *topmsg = field->message; printf("prefix: %d\n", (int)topmsg->prefix); if (field->tag == OneOfMessage_submsg1_tag) { SubMsg1 *msg = field->pData; printf("submsg1 {\n"); msg->array.funcs.decode = print_int32; msg->array.arg = " array: %d\n"; } } int main(){ uint8_t buffer[256]; OneOfMessage msg = OneOfMessage_init_zero; pb_istream_t stream; size_t count; // this is msg level callback, called when the submessage tag is Known, but before the actual msg is decoded. Set the decoding function there, or simply decode the function right on the spot. msg.cb_values.funcs.decode = msg_callback; stream = pb_istream_from_buffer(buffer, count); if (!pb_decode(&stream, OneOfMessage_fields, &msg)){ return 1; } /* This is just printing for the test case logic */ if (msg.which_values == OneOfMessage_intvalue_tag) { printf("prefix: %d\n", (int)msg.prefix); printf("intvalue: %d\n", (int)msg.values.intvalue); } }
-
we need special encoding & decoding callbacks only for 1. sub messages 2."array" types with unknown size, such as string, array (repeated), 3. msg level callback for oneof msg
-
Python msg.WhichOneof('ONEOF_FIELD_NAME') == "ONEOF_FIELD_TAG"
-
Forward compatibility vs backward compatibility: Forward compatibility: your new code can be run on the old system;
- Certain rules for both:
- Never change field tag after it's defined -Fields can be removed, as long as tag number is not used again(rename the field instead, like adding "OBSOLETE_"), so future users of your proto can't accidentally reuse the number
- Changing data types can be complicated, so make new fields instead
- When adding fields, think: does the default value makes sense to th old code??
- Field name changes are trivial, tag number is important for protobuf!!
- When you delete fields, mark them as reserved, so there won't be code conflict.
- This is to prevent this scenario: you somehow still have an old proto, future user loads it.
- Certain rules for both:
- in nanopb 0.4.1, required field right in front of optional will generate "required field msg"
- not having all decoded msg for a msg type will yield seg fault.
- Bug 2: segfault
- solve the missing field problem (bug 1), https://jpa.kapsi.fi/nanopb/docs/whats_new.html
- byte decoding: we need extra decoding for bytes
bool ShelfConnection::packet_decode(pb_istream_t* stream, const pb_field_t* field, void** arg) { // This is how we get the ptr to the struct for storing the camera data // Event is not automatically generated Event* msg = ((Event*)*arg); auto pdata = std::make_unique<std::vector<uint8_t>>(stream->bytes_left); if (!pb_read(stream, &(*pdata)[0], stream->bytes_left)) { ERROR("CameraPacket data read failure : %s\n", PB_GET_ERROR(stream)); return false; } msg->packetsData_.emplace_back(std::move(pdata)); return true; }
- byte decoding: we need extra decoding for bytes
- solve the missing field problem (bug 1), https://jpa.kapsi.fi/nanopb/docs/whats_new.html
- when printing optional fields, make sure has_FIELD is set to true. Otherwise, even tho the field is filled, you still can't print it!!
- When writing a fixed-size array, you gotta set FIELD_count to a number.
========================================================================
========================================================================
- A service is a set of endopoints, your application can be accessed from
- you can define a service on top of msgs.
syntax = "proto3" message SearchRequest{ int person_id = 1; } message SearchResponse{ string person_name =1 ; } service SearchService{ rpc Search (SearchRequest) returns (SearchResponse) }
- Protocol Buffer Services need to be interpreted by a framework, to generate associated code.
- A main one is gRPC. Came about with proto3
- you can use any language to generate gRPC servers, clients, and the generated code will send proto request, proto response, etc.
- Micro Service:
- They contains a function of your business and are written in different languages.
- Micro-Services must agree on:
- data format, error patterns, load-balancing
- One popular is REST (HTTP-JSON), another is gRPC
- API: it's a contract, I send you a request, you send me a response
- But it's not easy to build API:
- Data model: JSON, XML, Binary?
- end point format: GET /api/v1... post /api/v1/user..
- How much data in one call?
- Latency?
- Scalability to 1000 clients?
- But it's not easy to build API:
- gRPC:
-
Developed by google, part of "Cloud Native Computation Foundation (CNCF)", which has docker and kubernetes.
-
Allows you to define a high level Request, response for RPC (Remote Procedure Calls), and handles the rest for you
- Solves many RPC's problems
- response, requests are in proto
- In client and server code, an RPC request, response will look JUST LIKE A FUNCTION CALL!
-
Fast, low latency, load-balancing, logging, all handled for u
-
- API: it's a contract, I send you a request, you send me a response
========================================================================
========================================================================
- Protobuf is universal because serialization is the same for any language
- Serialization and deserialization is built on top of Varint. (variable Integer)
- VarInt uses an arbitrary num of bytes, to represent arbirarily large values.
- The magic to that is the last "octet" (byte) must have MSB 0.
- Example
300 = 10 0101100 => (1)010 1100 (0000 00)10
- VarInt uses an arbitrary num of bytes, to represent arbirarily large values.
========================================================================
========================================================================
- Intro
- It's an IDL & underlying messaging format:
- IDL (Interface Description Language): allows programs written in different languages to talk to each other.
- protobuf, rest, JSON
- User Work flow
client.say_hello() -> client stud sends request -> Server receives request, process it, sends SayHelloResponse back -> client stud receives it, and return value to client.say _hello()
-
.proto
is a text file that defines the service// The greeter service definition. service Greeter { // Sends a greeting rpc SayHello (HelloRequest) returns (HelloReply) {} } // The request message containing the user's name. message HelloRequest { string name = 1; } // The response message containing the greetings message HelloReply { string message = 1; }
- This needs to be compiled by the
protoc
compiler -
pb2
here2
means protobuf Python API 2
- It's an IDL & underlying messaging format:
- A bit about the code:
- server:
- Launches a grpc server with a threadpool
- Add the user-supplied server code to the server
- Client:
- set up insecure channel with IP
- Through the channel, call the server through gRPC.
- server:
- Prep steps
python -m pip install --upgrade pip python -m pip install grpcio # Need to install the latest protobuf pip install --upgrade protobuf python -m pip install grpcio-tools
- Run steps:
- compile
python -m grpc_tools.protoc -I <PROTO_PATH> --python_out=<MAINFILE_DIR> --pyi_out=<MAIN_FILE_DIR> --grpc_python_out=<MAIN_FILE_DIR> <PROTO_FILE_NAME>
- compile
- Functionalities
- Two types of gRPC responses / requests:
stream
, vsunary
-
insecure_channel
doesn't have HTTP - Can server and client in
synchronous
, andasynchronous
manner - status check, Link
- Two types of gRPC responses / requests: