Communication between Coordinator and Data nodes - radumarias/rfs GitHub Wiki

We can use Kafka to communicate with data nodes, each data node will have its own topic. This is similar to Actor pattern.
We can try AutoMQ it reduces costs compared to Kafka.

Alternatively we can communicate directly with gRPC using Apache Arrow Flow but this could create congestion.

When user uploads a file the master split the file in shards of 64MB and distribute the shards and replicas to data nodes. Then data nodes use DHT and BitTorrent to sync the shards saving the status to tikv. In order for coordinator to catch cases when a data node goes does down we keep a queue of ongoing sync tasks and periodically check in there the status and if a data node is down it allocate the replica to another node.

To optimize transfer we can implement the transport layer in BitTorrent client using QUIC and Direct I/O or Zero-copy https://chatgpt.com/share/42962c3a-992f-49d6-a774-478be7faff4a . This helps transferring the data from the disk directly to the network layer without involving the OS's buffer and minimizing CPU usage.