RPC - acmsigcomm19hackathon/hackathonprojects GitHub Wiki

Introduction

A remote procedure call (RPC) is a network programming model or interprocess communication technique that is used for point-to-point communications between software applications. Client and server applications communicate during this process. A remote procedure call is sometimes called a function call or a subroutine call. An RPC is analogous to a function call. Like a function call, when an RPC is made, the calling arguments are passed to the remote procedure and the caller waits for a response to be returned from the remote procedure.

RPC has been become a fundation in many distributed systems, such as block storage, machine learning, etc., in modern data centers. The efficiency and performance of RPC layer is critical to the overall performance of cloud services.

gRPC (https://github.com/grpc) is a modern open source high performance RPC framework that can run in any environment. It can efficiently connect services in and across data centers with pluggable support for load balancing, tracing, health checking and authentication. It is also applicable in last mile of distributed computing to connect devices, mobile applications and browsers to backend services. In this challenge, we will build some measurement and diagnosis modules for in gRPC.

Problems

Problem-1 (Easy): Build a workload generator which can generate representative RPC workloads. The workloads should at least represent two realistic applications, such as block storage, web search or distributed machine learning in clouds. One simple workload for testing is 4K bytes request and 256 bytes response.

Problem-2 (Advanced): Build a monitoring module into gRPC which can profile the latency of each RPC call in different stages (e.g. memory copy, connection establishment, data transferring, disk IO etc.)

Problem-3 (Hard): Extend the monitoring module with a diagnosis function which reports the reasons of long tail latency (if any). For instance, it should identify the components in tail latencies and quantify their fractions in the overall latency. The components can be (1) congestion or packet loss inside network; (2) packet buffering or loss on host NIC; (3) packet buffering or loss in software stack; (4) delay caused by OS scheduling; (5) delay by IO and so forth.

Problem-4 (Bonus): Making the CPU and memory overhead as low as possible. CPU < 5% core

Mentor: Lingjun Zhu (AliCloud) [email protected]