protobuf - RicoJia/notes GitHub Wiki

========================================================================

Proto Basics

========================================================================

  • Motivation: we want to transmit class objects, by serializing it into binaries. Protobuf generated serializatble datastructure in different languages

    • pb.h, pb2.py
    • challenges: receiving & reading code must be with excatly the same memory layout, endianness
  • Choices

    • CSV: types are inferred, and not guranteed. Good for visualization tho
    • JSON: javascript Object Notation
      • pros: data can take any form, widely accepted in the web, can be read by most languages,
      • cons:
        • no schema so you can put anything anywhere, JSON won't complain.
        • No documentation, no metadata
        • Can be quite big, due to repeated keys
    • protobuf
      • defined in .proto textfile
          example.proto 
          syntax = "proto3"
        
          message Msg{
              int32 id = 1; 
              string first_name = 2; 
            }
        
        • Pros:
          • compressed (less CPU usage)
            • 3x smaller, 20x faster than XML
          • schema: needed to generate code and read data
          • fully typed
          • can be read by all main languages
          • schema can evolve over time
          • code is generated automatically!
        • Cons:
          • support for some languages might be lacking
          • can't open serialized data with a text editor, cuz it's compressed and serialized)
        • Applications: RPC frameworks, such as gRPC to exchange data.
          • proto3 comes out after 2016 by google
  • Links:

    1. add person
    2. more protos
    3. Many examples
  • Data types

    • float: 32 bits, double:64bit
      • sint32, sint64 are more efficient with negative nums than int32, int64.
        • They use a technique called "zigzag", like 1->1, -1->2, 2->3 ... you truly just have one more bit. int64 is always 10 bytes long??
        • JSON would encode 26 as bytes representing 2, and 6.
    • map
      • map<string, Result> results = 2
      • Map cannot be repeated
      • there's no ordering for map
    • Time
      • Google.Protobuf.WellKnownTypes.Duration
      • Google.Protobuf.WellKnownTypes.Timestamp
    • bytes: small image
    • enum: if you know all the fields in advance
      • in proto2, no need for default as required can tell u that
      • For proto, you need to be careful with default
      syntax = "proto3"
        message Msg{
          enum EyeColor{
              UNKNOWN_EYE_COLOR = 0;    // this is the default value of "EyeColor"
              GREEN = 1; 
              BLACK = 2;      // good to have all caps
            }
          EyeColor eye_color = 8;       // this is the "instance of enum"
          }
      
      • C++: you just do: msg->ENUMFIELD to print the value
  • Tags

    • tags from 1 to 15 use 1 byte,
    • 16 - 2047 uses 2 bytes
  • Fields:

    • repeated: for array or list, can take any number (0 included)
      • custom func needed if no size is specified
    • // means comments
    • default values are defined in proto3, but not in proto 2 (which has optional, required, etc.)
      • bool: false
      • string, empty,
      • enum: first values
      • repeated: empty list
  • Naming Convensions: This is for the compiler to work properly.

    • CamelCase for msgs
    • underscore_separated_names for fields
    • CamelCase for Enums CAPITAL_WITH_UNDERSCORE for value names
    • Uber's proto guide

Multiple msg definition

  • You can have on message file containing another
      message Person {
          ...
          Date birthday;    //Note the msg is defined underneath
        }
      message Date{
    
        }
    
  • Or
      message Msg{
          message Address{
              string addr = 1;
            }
          repeated Address addr= 10;
        }
    
  • import a file: import "dir/3-date.proto"
    • "package: " is actually a namespace
      //File 1
      syntax = "proto3"
      package myData
      
      //File2
      import "sub_dir/file.proto"
      message Person{
          myData.Date date = 7; 
        }
      

========================================================================

Installation and compilation

========================================================================

  • Installation
      # Make sure you grab the latest version
      curl -OL https://github.com/google/protobuf/releases/download/v3.5.1/protoc-3.5.1-linux-x86_64.zip
      # Unzip
      unzip protoc-3.5.1-linux-x86_64.zip -d protoc3
      # Move protoc to /usr/local/bin/
      sudo mv protoc3/bin/* /usr/local/bin/
      # Move protoc3/include to /usr/local/include/
      sudo mv protoc3/include/* /usr/local/include/
      # Optional: change owner
      sudo chown $USER /usr/local/bin/protoc
      sudo chown -R $USER /usr/local/include/google
    
  • protoc stands for proto buffer, c compiler
  • To compile the code, do: protoc -I=PROTO_DIR --python_out=PYTHON_DIR PROTO_DIR/simple.proto
    • -I means the dir to search in for imports
    • --python_out ... this is how to compile proto
    • use PROTO_DIR/*.proto
    • Then, you can see file_pb2.py

========================================================================

Encoding, Decoding

========================================================================

test script

  • if a field is set to default value, then print (msg) wouldn't show the field

Decoding

  1. if you have two schemas, example.proto, scalar_types/proto:
    • if there are same fields in the same order, then one field can decode
    • if not (two separate fields), then you will see an empty msg object, no errors.

Cpp

  1. oneof
  • Callback
      bool oneof_dec_cb(pb_istream_t* stream, const pb_field_t* field, void** arg){
        switch (field->tag) {
          case ShelfEvent_shelf_data_tag: {   //EventData_camera_data_tag 
          }
        }
      }
  1. Nanopb would add 4 bytes to the beginning of each message to signify lengt . Link

    • if you don't remove it, you will see error parsing message
  2. we use nanopb because:

    1. you have to manually alloc to serialize and deserialize stuff. Nanopb handles them for u. Nanopb was created for protobuf
      Example mymessage = {42};
      uint8_t buffer[10];
      pb_ostream_t stream = pb_ostream_from_buffer(buffer, sizeof(buffer));
      pb_encode(&stream, Example_fields, &mymessage);
    2. In c, comaptible with 32 bit microcontrollers.
  3. Varint: variable length int: small int will take up 1 byte. int32, sint32 are all varints. fixed int are fixed32, sfixed64

    • use _DecodeVarint32 (buf, starting_pos) to decode
  4. work with proto

    def __deserialize_msg(self, msg): 
        #TODO
        n = 0
        data_len, data_pos = _DecodeVarint32(msg, n)
        n = data_pos
        data_buf = msg[n:n + data_len]
        n += data_len
        return shelf_event_pb2.ShelfEvent().FromString(data_buf)
    
    def __reserialize_msg(self, proto): 
        """
        Reserialize the proto and return it
        """
        gd_serialized = gd.SerializeToString()
        f = BytesIO()
        _EncodeVarint(f.write, len(gd_serialized))
        f.write(gd_serialized)
        return f.getbuffer().tobytes() 

========================================================================

Sample Msgs

========================================================================

  1. Simple_msg
  • Simple_msg.proto
    syntax = "proto3";        #must specify proto2 or 3
    package example.simple;   #This is more like a namespace
    message SimpleMessage{    #This is a very simple msg
        int32 id = 1; 
        bool is_simple = 2; 
        string name = 3; 
        repeated int32 sample_list = 4; #repeated means list
      }
  • Simple_demo.py
    import simple.simple_pb2 as simple_pb2              # generated python msg
    simple_msg = simple_pb2.SimpleMessage()             # Create a msg obj
    simple_msg.id = 113
    simple_msg.is_simple = True
    simple_msg.name = "test"
    
    
    # this is how to make a list
    #Method 1: append, make a copy of the number, 
    sample_list = simple_msg.sample_list                #You must get a reference to the List object in simple_msg
    sample_list.append(123)                             #Then you can append stuff
    sample_list.append(456)
    
    #method 2: extend, working for a message type, not just appending raw numbers
    simple_msg.sample_list.extend([1,2,3])
    
    with open("simple.bin", "wb") as f:                 #wb means write, binary mode, output is called simple.bin
        byteString = simple_msg.SerializeToString()     
            f.write(byteString)
    
    with open("simple.bin", "rb") as f: 
        simple_msg_read = simple_pb2.SimpleMessage().FromString(f.read())
        print(simple_msg_read.sample_list)
  1. Complex Msg
    • Complex.proto
      syntax="proto3";
      package example.complex;
      message ComplexMsg{
        DummyMsg one_dummy = 1;
        repeated DummyMsg multiple_dummy = 2;
        }
      message DummyMsg{
        int32 id = 1;
        string name = 2;
        }
    • Complex.demo.py
        import complex_pb2
        complex_msg = complex_pb2.ComplexMsg()
      
        #Below is not correct: you cannot assign field to an object. 
        #Maybe a const pointer. 
        # one_dummy_msg = complex_pb2.DummyMsg()
        # one_dummy_msg.id = 1
        # complex_msg.one_dummy = one_dummy_msg
      
        complex_msg.one_dummy.id = 1
        complex_msg.one_dummy.name = "hehe"
      
        #Method 1: adding a ref to repeated, but sometimes add() does not exist for simple objects
        msg1 = complex_msg.multiple_dummy.add();
        msg1.id = 1
        msg1.name = "hehehe"
      
        #Method 2: adding a ref to repeated
        msg2 = complex_msg.multiple_dummy.add(id=2, name="rj");
      
        #Method 3: not recommended: copying msg3
        msg3 = complex_pb2.DummyMsg()
        msg3.id = 3
        msg3.name = "hehehehe"
        complex_msg.multiple_dummy.extend([msg3])
      
        print(complex_msg)

========================================================================

Protobuf CPP

========================================================================

  1. Compilation protoc --proto_path=src --cpp_out=build/gen src/foo.proto src/bar/baz.proto

========================================================================

Protobuf Python

========================================================================

  1. oneof

    msg.tag_inside_oneof = ...
    • Only the last field that was set gets to keep its value
    • In generated code, the API is the same, except there's one more function for checking the last field.
      • each time, there will be a tag generated in the received msg
        evt.which_event == Field_Tag
        
  2. Enum: each field is in the msg directly, no Enum field required

    metadata1.node_type = oct_command_pb2.WEIGHT
  3. Field checks

    1. see if there's oneof field at all
      if proto.WhichOneof("event"):
    2. see which oneof field:
      if proto.HasField("announcement"):
    3. construct a proto with keyworded arg: (must have keyword)
      new_time_info = shelf_event_pb2.TimeInfo(ts=new_ts, us=new_us)
    4. No assignment is allowed in proto:
      proto.event_data.camera_data.time_info_after.CopyFrom(new_time_info_after)
    5. Optional compound object is empty by default.
      #TODO
      test_proto = shelf_event_pb2.ShelfEvent()
      print("===============")
      print("test proto: ", test_proto)
      test_proto.event_data.CopyFrom(shelf_event_pb2.EventData()) #will show empty object
      test_proto.event_data.camera_data.CopyFrom(shelf_event_pb2.CameraData())
      print("event_data: ", test_proto.event_data)
      print("has camera_data: ", test_proto.event_data.HasField("camera_data")) #if not initialied, false
      print("===============")

========================================================================

Nanopb

========================================================================

  1. why nanopb over protobuf?

    • nanopb is ANSI C, American National Standards, a set of successive standards, suitable for microcontrollers
    • nanopb generator is built on top of Google's protoc
    • Two ways to convert .proto to c:
      # If you have downloaded nanopb binary: 
      generator-bin/protoc --nanopb_out=. myprotocol.proto
      #if not, 
      protoc --plugin=protoc-gen-nanopb=nanopb/generator/protoc-gen-nanopb ...
  2. Key functions:

    SimpleMessage message = SimpleMessage_init_zero;
    /* Create a stream that will write to our buffer. */
    pb_ostream_t stream = pb_ostream_from_buffer(buffer, sizeof(buffer));
    message.lucky_number = 13;
    // write to the output stream
    status = pb_encode(&stream, SimpleMessage_fields, &message);
    message_length = stream.bytes_written;

pb_encode(&output_stream, SimpleMessage_Fields, &msg)

  • nanopb's encode function pb_encode will encode a message field by field, then it's going to encode tag and the actual value, so it's possible to have the write_callback multiple times. The problem is in the send function itself
  • if callback return false, pb_encode will stop.
    • actually the callback was called 5 times. A simple message is called 5 times, shelf_event is called 3 times.

pb_decode

  • What about repeated fields?
//1. OK
struct decoder_params {
  Event* event;
  bool before;
  decoder_params() : event{}, before{} {
  }
  // Decode the before packets
  decoder_params bparam;
  bparam.event = msg.get();
  bparam.before = true;

  2. before, after set up// Need decode function for: camera_data,
  request->camera_data.packets_before.funcs.decode = &packets_decode;
  request->camera_data.packets_before.arg = (uint32_t*)&bparam;

// callback function
bool ShelfConnection::packets_decode(pb_istream_t* stream, const pb_field_t* field, void** arg) {
    decoder_params* param = ((decoder_params*)*arg);
    Event* event = param->event;
    auto data = std::make_unique<PacketData>();
    data->packet_->extrinsics.funcs.decode = extrinsics_decode;
    data->packet_->extrinsics.arg = (void*)&data->extrinsics_;

    data->packet_->data.funcs.decode = packet_decode;
    data->packet_->data.arg = (uint32_t*)event;

    if (!pb_decode(stream, Packet_fields, data->packet_.get())) {
        ERROR("CameraPacket decode failure : %s\n", PB_GET_ERROR(stream));
        return false;
    }

    data->type_ = (param->before) ? packet_type::BEFORE : packet_type::AFTER;

    DEBUG("Camera Frame : ID:%d, type:%d, before:%d size:%d, cap ts:%u, enc ts:%u", data->packet_->camera_id,
          data->packet_->frame_type, data->type_, data->packet_->size, data->packet_->cap_ts, data->packet_->enc_ts);

    event->packets_.emplace_back(std::move(data));
    return true;
}
  • decoder

    • Fun: how decode is defined?
      bool (*callback)(pb_istream_t *stream, pb_byte_t *buf, size_t count);
    • pb_callback_t has callback and arg in pb_common.h
        decoder_params bparam;
        request->camera_data.packets_before.funcs.decode = &packets_decode;
        msg.cb_event.arg = (unit32*)&bparam    //
        bool ShelfConnection::packets_decode(pb_istream_t* stream, const pb_field_t* field, void** arg){
          decoder_params* param = ((decoder_params*)*arg);    //why void **? 
          Event* event = param->event;
        }
  • Oneof decoding good example

    • we need special encoding & decoding callbacks only for 1. sub messages 2."array" types with unknown size, such as string, array (repeated), 3. msg level callback for oneof msg
      • If any data remains in stream, normal decoding will continue.
      • E.g
        // The proto file
        message OneOfMessage{
        option (nanopb_msgopt).submsg_callback = true;
        int32 prefix = 1;
        oneof values{
            int32 intvalue = 5;
            string strvalue = 6 [(nanopb).max_size = 8];
        }
      
        bool print_int32(pb_istream_t *stream, const pb_field_t *field, void **arg) {
            uint64_t value;
            if (!pb_decode_varint(stream, &value))
                printf("%d", value); 
                return false;
      
            return true;
        }
        // The main file
        /* The callback below is a message-level callback which is called before each
         * submessage is encoded. It is used to set the pb_callback_t callbacks inside
         * the submessage. The reason we need this is that different submessages share
         * storage inside oneof union, and before we know the message type we can't set
         * the callbacks without overwriting each other.
         */
        bool msg_callback(pb_istream_t *stream, const pb_field_t *field, void **arg)
        {
            /* Print the prefix field before the submessages.
             * This also demonstrates how to access the top level message fields
             * from callbacks.
             */
            OneOfMessage *topmsg = field->message;
            printf("prefix: %d\n", (int)topmsg->prefix);
            if (field->tag == OneOfMessage_submsg1_tag)
            {
                SubMsg1 *msg = field->pData;
                printf("submsg1 {\n");
                msg->array.funcs.decode = print_int32;
                msg->array.arg = "  array: %d\n";
            }
        }
      
        int main(){
          uint8_t buffer[256];
          OneOfMessage msg = OneOfMessage_init_zero;
          pb_istream_t stream;
          size_t count;
      
          // this is msg level callback, called when the submessage tag is Known, but before the actual msg is decoded. Set the decoding function there, or simply decode the function right on the spot. 
          msg.cb_values.funcs.decode = msg_callback;
          stream = pb_istream_from_buffer(buffer, count);
          if (!pb_decode(&stream, OneOfMessage_fields, &msg)){
              return 1;
          }
      
          /* This is just printing for the test case logic */
          if (msg.which_values == OneOfMessage_intvalue_tag) {
              printf("prefix: %d\n", (int)msg.prefix);
              printf("intvalue: %d\n", (int)msg.values.intvalue);
          }
      
        }
  • Python msg.WhichOneof('ONEOF_FIELD_NAME') == "ONEOF_FIELD_TAG"

  • Forward compatibility vs backward compatibility: Forward compatibility: your new code can be run on the old system;

    • Certain rules for both:
      • Never change field tag after it's defined -Fields can be removed, as long as tag number is not used again(rename the field instead, like adding "OBSOLETE_"), so future users of your proto can't accidentally reuse the number
      • Changing data types can be complicated, so make new fields instead
      • When adding fields, think: does the default value makes sense to th old code??
      • Field name changes are trivial, tag number is important for protobuf!!
      • When you delete fields, mark them as reserved, so there won't be code conflict.
        • This is to prevent this scenario: you somehow still have an old proto, future user loads it.

Cautions

  • in nanopb 0.4.1, required field right in front of optional will generate "required field msg"
  • not having all decoded msg for a msg type will yield seg fault.
  • Bug 2: segfault
    • solve the missing field problem (bug 1), https://jpa.kapsi.fi/nanopb/docs/whats_new.html
      • byte decoding: we need extra decoding for bytes
          bool ShelfConnection::packet_decode(pb_istream_t* stream, const pb_field_t* field, void** arg) {
              // This is how we get the ptr to the struct for storing the camera data
              // Event is not automatically generated
              Event* msg = ((Event*)*arg);
              auto pdata = std::make_unique<std::vector<uint8_t>>(stream->bytes_left);
        
              if (!pb_read(stream, &(*pdata)[0], stream->bytes_left)) {
                  ERROR("CameraPacket data read failure : %s\n", PB_GET_ERROR(stream));
                  return false;
              }
        
              msg->packetsData_.emplace_back(std::move(pdata));
              return true;
          }
  • when printing optional fields, make sure has_FIELD is set to true. Otherwise, even tho the field is filled, you still can't print it!!
  • When writing a fixed-size array, you gotta set FIELD_count to a number.

========================================================================

Protocol Buffer Services and gRPC

========================================================================

  • A service is a set of endopoints, your application can be accessed from
  • you can define a service on top of msgs.
      syntax = "proto3"
      message SearchRequest{
          int person_id = 1; 
        }
      message SearchResponse{
          string person_name =1 ; 
        }
      service SearchService{
          rpc Search (SearchRequest) returns (SearchResponse)
        }
    
  • Protocol Buffer Services need to be interpreted by a framework, to generate associated code.
    • A main one is gRPC. Came about with proto3
  • you can use any language to generate gRPC servers, clients, and the generated code will send proto request, proto response, etc.
  • Micro Service:
    • They contains a function of your business and are written in different languages.
    • Micro-Services must agree on:
      • data format, error patterns, load-balancing
      • One popular is REST (HTTP-JSON), another is gRPC
        • API: it's a contract, I send you a request, you send me a response
          • But it's not easy to build API:
            • Data model: JSON, XML, Binary?
            • end point format: GET /api/v1... post /api/v1/user..
            • How much data in one call?
            • Latency?
            • Scalability to 1000 clients?
        • gRPC:
          • Developed by google, part of "Cloud Native Computation Foundation (CNCF)", which has docker and kubernetes.

          • Allows you to define a high level Request, response for RPC (Remote Procedure Calls), and handles the rest for you

            • Solves many RPC's problems
            • response, requests are in proto
            • In client and server code, an RPC request, response will look JUST LIKE A FUNCTION CALL!

            image

          • Fast, low latency, load-balancing, logging, all handled for u

========================================================================

Theory

========================================================================

  1. Protobuf is universal because serialization is the same for any language
  • Serialization and deserialization is built on top of Varint. (variable Integer)
    • VarInt uses an arbitrary num of bytes, to represent arbirarily large values.
      • The magic to that is the last "octet" (byte) must have MSB 0.
      • Example
          300 = 10 0101100 
          => (1)010 1100 (0000 00)10
        

========================================================================

gRPC basics

========================================================================

  1. Intro
    • It's an IDL & underlying messaging format:
      • IDL (Interface Description Language): allows programs written in different languages to talk to each other.
      • protobuf, rest, JSON
    • User Work flow
      client.say_hello() -> client stud sends request -> Server receives request, process it, sends SayHelloResponse back -> client stud receives it, and return value to client.say _hello()
    • .proto is a text file that defines the service
      // The greeter service definition.
      service Greeter {
        // Sends a greeting
        rpc SayHello (HelloRequest) returns (HelloReply) {}
      }
      
      // The request message containing the user's name.
      message HelloRequest {
        string name = 1;
      } 
      
      // The response message containing the greetings
      message HelloReply {
        string message = 1;
      }
      
    • This needs to be compiled by the protoc compiler
    • pb2 here 2 means protobuf Python API 2
  2. A bit about the code:
    • server:
      1. Launches a grpc server with a threadpool
      2. Add the user-supplied server code to the server
    • Client:
      1. set up insecure channel with IP
      2. Through the channel, call the server through gRPC.
  3. Prep steps
    python -m pip install --upgrade pip
    python -m pip install grpcio
    # Need to install the latest protobuf
    pip install --upgrade protobuf
    python -m pip install grpcio-tools
    
  4. Run steps:
    1. compile
      python -m grpc_tools.protoc -I <PROTO_PATH> --python_out=<MAINFILE_DIR> --pyi_out=<MAIN_FILE_DIR> --grpc_python_out=<MAIN_FILE_DIR> <PROTO_FILE_NAME>
      
  5. Functionalities
    • Two types of gRPC responses / requests: stream, vs unary
    • insecure_channel doesn't have HTTP
    • Can server and client in synchronous, and asynchronous manner
    • status check, Link
⚠️ **GitHub.com Fallback** ⚠️