The big picture - radumarias/rfs GitHub Wiki
This is the high level of the implementation. This needs to be broken in smaller tasks each documented in wiki.
Coordinator nodes
- We will have a cluster of coordinator nodes with Raft or if we don't want the penalty of a single master at a time we might explore
CRDTs
built with datacake, crdts andRedis
Sets
or distributedBloom Filter
to enforce constraints like uniqueness of filenames in same folder. - These will handle operations on metadata and will then forward the user to data nodes for accessing the actual file content.
- https://github.com/radumarias/rfs/wiki/Coordinator-nodes-cluster
- https://github.com/radumarias/rfs/wiki/Coordinator-and-Data-nodes
- https://github.com/radumarias/rfs/wiki/Communication-between-Coordinator-and-Data-nodes
- https://github.com/radumarias/rfs/wiki/Coordinator-API
Data nodes
- These will handle the actual file content, sharing and replication. Will be a cluster similar to the above with Raft or some masterless solution.
- After client defines the metadata it will be redirected to these nodes to work with the actual file content.
- https://github.com/radumarias/rfs/wiki/Coordinator-and-Data-nodes
- https://github.com/radumarias/rfs/wiki/Communication-between-Coordinator-and-Data-nodes
- https://github.com/radumarias/rfs/wiki/Data-node-API
Communication between nodes
gRPC
is good for inter services (nodes) communication.Protobufs
is interesting, it's a very good protocol for serialization and communicationApache Arrow
format andFlight
which communicates overgRPC
. One advantage is it eliminates serialization, basically it sends over the wire the internal representation from memory and reads directly from that, seems like interesting to use- Extending from above it would be interesting to have
*Arrow Flight
withRDMA
, they have a feature request for this but it's still in progress. In the meantime we can start using Apache Arrow Flight over gRPC - Services could also communicate over
Kafka
,Pulsar
,RabbitMQ
or any otherPub/Sub
systems. At least messages intended for all nodes could be transmitted like that. Then for direct messages between nodes (some file sync state messages) we can use gRPC - https://github.com/radumarias/rfs/wiki/Communication-between-Coordinator-and-Data-nodes
- https://github.com/radumarias/rfs/wiki/Coordinator-and-Data-nodes
Storing metadata
This is the synchronized DB that keeps metadata which all nodes need to access.
SurrealDB
seems very interesting, they have multi model, based on your use case they use different solution underneath and they have distributed one, they usetikv
,key-value pair
, good enough for us. Other solution areCockroachDB
it has strong consistency: uses theRaft
consensus algorithm to ensure strong consistency, evenApache Cassandra
but maybe it's too big for our needs. In the end the content size of the files would way exceed the metadataRaft
is widely used, very performant and easy to understand algorithmZooKeeper
is a good distributed solution forservices discovery
andconfiguration management
. There are better ones now likeClickHouse Keeper
,kRaft
which is used inKafka
, I assume I could use that one too. And there isetcd
, which is used inKubernates
.- For
multi-master
configuration when something changes they will all need to consent before committing the change. This would add latency so I've read aboutCRDTs
(Conflict-free replicated data type) andEventual Consistency
which sounds just like what we need. There are crates in Rust for these, so we're covered. Just need to see how DB and configuration management supports this (CockroachDB uses Raft which seems it does), if not we will go with what they have. Even if there's a little more latency, at least the file metadata is safe - https://github.com/radumarias/rfs/wiki/Metadata-DB
File sync
- We will use
BitTorrent
with transport layer overQUIC
and usingzero-copy
, the speeds would be incredible I would imagine. This combination would be perfect as we have replicated shards, and as one node could read from several ones,BitTorrent
protocol makes sense. I didn't found something like this implemented in Rust so it's a good starting project - I find
deduplication
very interesting and practical, heard about it in context ofborgbackup
(which I actively use to backup my files), which does inter-files deduplication. Given our service would store many files deduplication could have a major impact. Or course it would reduce the speed but it could be optional. Maybe an interesting idea would be to deduplicate only files which are rarely used, or some of the replicas only Compression
is also useful.LZ4
I see it's very used nowadays and it's quite fast with reasonable compression ratio. Alternatives arexz
,LZMA
,7z
,lzo
(it's very fast from what I read). A benchmark- Some could want the content to be
encrypted
. For that we're thinking to use rencfs, which could be a good fit for this, and would benefit from the needed enhancements - https://github.com/radumarias/rfs/wiki/File-sync-between-data-nodes
- https://github.com/radumarias/rfs/wiki/File-sharding
- https://github.com/radumarias/rfs/wiki/File-upload-and-changes
- https://github.com/radumarias/rfs/wiki/Sync-file-changes
Observability
- Will use
Grafana
solutions andPrometheus
formonitoring
,logs
,tracing
andmetrics
.
Containerization
AWS EKS
would be a good fit for this. Had only good experience withk8s
and I recommend it.
Cluster management
- First will need a
CLI
to manage the cluster - Then will expose a
REST API
for more flexible management. We can useKeycloak
forauthentication
withOpenID Connect
andOAuth 2.0
forauthorization
. An S3-Compatible API would be great gRPC
service would be great- Thinking it would be practical to offer a
FUSE
interface to mount given parts of the cluster directly into OS and work with the filesystem. This would require a desktop app or at least a daemon. We will expose overNFS
too - We could also have a
webapp
and havemobile apps
to manage the cluster and access the files - Files would be grouped by
tenant
anduser
, so we could model something like S3, Google Drive, Dropbox but also Hadoop like
Tech stack
We'll build it mostly in Rust
and maybe with a bit of Java
if really needed for Spark
and Flink
and Python
for Airflow
.
Scope | Solution
==================|=================================================
REST API, gRPC | axum, tonic
Websocket | tokio-tungstenite
Metadata DB | SurrealDB, CockroachDB
Config | ZooKeeper, ClickHouse Keeper, kRaft, etcd
Browser app | egui, wasi, wasm-bindgen, rencfs
Desktop app | egui, mainline, transmission_rs,
| cratetorrent/rqbit,
| quinn, rencfs, pgp, fuse3
Local app mobile | Kotlin Multiplatform
Sync daemon | tokio, rencfs, mainline, transmission_rs,
| cratetorrent/rqbit, quinn
Use Kafka | rdkafka
Keycloak | axum-keycloak-auth (in app token verificaton) or
| Keycloak Gatekeeper
| (reverse proxy in front of the services)
Event Bus | Kafka, Pulsar, RabbitMQ
Streaming | Flink
processor |
File storage | RAID, ext4
Search and | ELK, Apache Spark, Apache Flink, Apache Airflow
Analytics |
Identity Provider | Keycloak
Cache | Redis
Deploy | Amazon EKS
Metrics | Prometheus and Grafana Mimir
Tracing | Prometheus and Grafana Tempo
Logs | Grafana Loki