Goals - dhobsd/lacquer GitHub Wiki
Lacquer Project Goals
The project aims to create a modern scalable server framework that checks a bunch of boxes. The high level goals of the project are:
- Collaboration
- Scalability
- Reliability
- Security
- Extensibility
- Visibility
Contributors to the project have years or decades of experience working with both proprietary and open-source codebases. We recognize that the state-of-the-art in network server software is not reflected in the most popular open-source projects. Although multicore and manycore systems are nearly ubiquitous, most servers still use threading models and synchronization strategies that inherently limit throughput and induce worst-case latency under high load. Multitenancy, when supported by software, comes at the cost of security: cross-user data is often only logically separated. These systems are often difficult to extend or adapt to new use-cases, often due to rigid architectures, unbounded execution time of tenant logic, conflation of configuration and logic, and a lack of visibility into what the system is doing.
The design goals specified here are intended to be high-level. As the primary goal is to provide a scalable server framework, initial work won't be targeting any protocol in particular. As such, the software should work as a starting point for implementing numerous protocols.
Collaboration
As an open-source project, we accept suggestions, feedback, and criticism with grace. We are supportive of each other and recognize that no single contributor has a monopoly on information about the totality of the system. Contributions are welcomed based on their adherence to the goals listed in this document; contributors strive to educate and share knowledge with others to better the system. Conduct and communication with other contributors, administrators, and tenants of the system is cordial. We do not tolerate discrimination based on age, nationality, sexual orientation, identity of self, hair color, or anything else you might find in a more completionist code of conduct. In some cases, language barriers may impede successful communication. This is an unfortunate form of discrimination that is difficult to solve; hopefully as the project gains momentum, we can attract folks who can help in this regard.
Scalability
Scalability, not performance, is the primary goal of Lacquer. The system should scale predictably to increasing load. We contend that a system designed to be scalable will also necessarily perform well in the average case.
Performance is often confused for scalability. The goal of a scalable system is to be predictable. Predictability is of paramount importance in modern systems software. When our systems are predictable, we have the ability to confidently model the impact of various changes in the system given various workloads.
Premature optimization is not a goal. The design criteria listed below come from experience with and documentation of existing systems. While rigid, these design principles result in predictability and scalability, not necessarily in higher performance. When making performance-related changes, profiling data are required.
Avoid Shared Mutable State
Whenever possible, avoid sharing memory. Shared-nothing systems are friendlier for modern architectures; atomic operations place a relatively high burden on CPUs and memory busses. A shared-nothing approach allows processors to make better use of caches and requires no heavyweight synchronization. Be careful here: because extensibility is a goal of the system, it's possible that users will devise creative ways to mutate data assumed to be read-only. Similarly, tasks running on the system should have relatively strict processor affinity. In the path of a request, we should avoid ping-ponging around different processors (and for sure different sockets).
Avoid Lock-Based Synchronization
Mutual exclusion locking strategies lead to queueing behaviors that necessarily increase latency (thereby reducing throughput) under high contention. Highly contended resources (counters, objects, connections, log production, etc.) should be managed using non-blocking synchronization (using e.g. lock-free queues). When mutable state must be shared, prefer mutating that state from a single concurrent actor. Multiple-producer systems are notoriously difficult to manage.
Data Locality and Layout
Modern architectures incur heavy penalties when they need to access data in remote memory. When mutating data, attempt to make sure that mutation occurs on a CPU or socket local to those data. Avoid indirection between data structures that are often accessed frequently.
Similarly, the layout of data in memory can have an impact on performance and scalability. Keep related data close in compound data structures. In doing so, also avoid false sharing.
Asynchronous I/O
When I/O operations would block, it makes sense to deschedule the blocking task such that another can make progress. We should not rely on OS preemptive scheduling to do this; context switches are unpredictable both in terms of when they occur, but in some cases how long they take.
Real-time Guarantees
When executing logic on behalf of a tenant of the system, we must place upper bounds on the amount of time that code can execute to maintain predictability. We leave it up to the administrator(s) of the system to determine what an acceptable upper bound may be.
Horizontal Scalability
In high-traffic environments, it's unreasonable to think any single machine can handle the total load. Lacquer should be able to function in local and global clusters such that availability, data redundancy, and scalability can be provided.
Reliability
Redundancy
Because Lacquer should support clustered operation, redundancy of data and service should be supported. It should be possible to replicate configuration changes between disparate instances of the software.
Zero-Downtime
It should be possible to upgrade the software without service interruption.
Testing
Because Lacquer is intended to be modular, APIs should be testable and tested. Comprehensive unit tests should be written for all APIs. End-to-end integration tests should validate correct behavior of the system in edge cases. Fuzz tests and property-based testing should be regularly performed both for security and correctness purposes.
Security
Multitenant software systems often go no further than a logical separation of data between tenants of the system. One of the more unfortunate consequences of this property is that bugs and exploits often yield access to potentially sensitive data belonging to another tenant of the system. Ideally, our design will also provide some security between outstanding requests in the system. While it may not be possible to protect against all data-sharing bugs (for example, leaking data between disparate requests to a single tenant as a result of a bug), a security- and privacy-first approach to systems design will go a long way towards mitigating these risks.
Network Security
All network connections should use TLS by default.
Data Security
Logical separation of data (provided by e.g. a large cryptographic hash space) is not enough to protect data between requests and between tenants of the system. Each tenant of the system should be confident that their data are protected by avoiding sharing of address space whenever possible.
Extensibility
Configuration
It's sometimes tempting to conflate logic with configuration, especially since the line between the two is frequently blurred. Configuration through command line flags is rigid and leads to verbose and sometimes ambiguous invocations of software. (For example, in some cases, a flag being specified twice means that the latter version takes effect; in other cases, the semantics are accumulative.) Configuration includes information about what system resources may be used (CPU / RAM limits, storage devices, network addresses and ports, protocols, etc.). This information is also sometimes useful to tenant logic. Where appropriate, this configuration information should be made available to tenant logic.
Most (if not all) configuration parameters should support run-time modification. It is acceptable for configuration options to have some delay to take effect. Such changes should be persisted back to the configuration file.
Modular Design
The system should be modular in terms of source organization and behavior. It should be possible to extend the system at runtime through loading new modules, and modules should support upgrading at runtime as well.
Tenant-Specified Logic
Protocols should expose hooks into their state machines such that tenants may specify logic to mutate, build upon, and change the flow of the state at opportune times.
Visibility
Documentation
Too often, documentation is an afterthought. Source-level documentation is encouraged, but this is often not useful to administrators of the software, or tenants of an installation. All aspects of the system must be documented to the aid of contributors, administrators, and tenants.
Debugging
Introspection into the system is necessary to make it easy to debug. Source-level documentation, tracing features, and debugging tools are all within scope of design.
Metrics
If a thing can be counted, it should.
Logs
Logs are important for understanding tons of things about how software operates, but are frequently a bottleneck. Short-term and long-term logs should be provided, but logging should never directly interfere with the progress through a protocol's state machine.