PDP 25 (Read Only SegmentStore) - derekm/pravega GitHub Wiki
Status: Under discussion
Related issues: Issue 1924, Issue 1917
Summary
This proposal discusses a solution that enables users to access data in Tier 2 directly, in large batches, without interfering with the Segment Store instances that are used for ingesting data. This may be used in a number of scenarios:
- Large amounts of reads are needed for various batch operations for processes that do not care about the latest data, but for which throughput is very important.
- Reading via the Segment Store's APIs would consume shared resources, such as network bandwidth or threads, which would negatively affect the performance of other operations supported by the Segment Store.
- Active-passive replication: data are ingested in a data center, but replicated (via Tier 2) into a different data center, where it needs to be consumed by various processes.
We therefore propose creating a slimmed down version of the Segment Store that processes Tier 2 reads only. We'll call this the Read-Only Segment Store (RSS) for the remainder of this document. The RSS will be a separate instance of the Segment Store (running in a different process) which can be used for read operations alone. No ingests or segment manipulations will be allowed through the RSS. Since the RSS is running separately from the Segment Store used for ingests, it will not affect ingestion performance (latency, throughput, etc.), regardless of how heavy the read operations will be.
API Changes
- Client
- New API (TBD) that exposes direct reads from Tier 2, without providing ordering between Segments.
- Refer to PR 1998 for a proposed implementation.
- A way to specify (config object or new API) that we want to connect to a RSS or normal Segment Store.
- New API (TBD) that exposes direct reads from Tier 2, without providing ordering between Segments.
- Wire Protocol
- We are planning to use the same transport as before
- We may need to add another Reply, which indicates that a particular operation is not supported (since the RSS will not support any modify operations (Create/Delete/append/seal/merge/etc).
Internal Changes
- The RSS is the same as the Segment Store (starts the same way, exposes the same StreamSegmentStore interface, but it will only implement
readandgetStreamSegmentInfo. All other operations will throwUnsupportedOperationException. - There will be no Segment Containers in this implementation. The RSS is a stateless, leaderless service which is able to process read requests for any Segment so there is no longer a need to map Segments to Containers for RSSs (this will still be necessary for the Segment Store instances used for data ingestion though, so nothing changes there).
- Service resolution
- The RSSs can register themselves in a ZK instance (or in some local Controller), and the Client can access that information in order to send the requests in a round-robin fashion to them. They can create an ephemeral node which is deleted when the RSS instance loses connectivity to ZK.
- This can scale horizontally by simply adding more instances of the RSS.
- For V1, though, we can skip this registration and have the Client take in the address and port of the RSS directly.
- The
readandgetStreamSegmentInforequests will be sent directly to Tier2 using the configured Tier2 adapter. These requests will need to fetch the contents of the Segment State file in order to determine attributes (forgetStreamSegmentInfo) or truncation offsets (for both operations). - There will be no "tail reads" supported here. Requests are either fulfilled using Tier2 data or rejected.
Compatibility and Migration
N/A
Discarded Approaches
- Implement Tier 2 access directly into Client.
- Pros:
- No need for yet another service that needs to be deployed and managed.
- No extra hop when reading may produce better performance.
- Cons:
- Upgrade difficulties. If a change to the Tier 2 adapter/layout is done Server-side, there may be clients out there that haven't been updated yet (and they may be hard to update due to other factors). This poses difficulties when the Tier2 file layout changes or when there are format changes being introduced.
- Thickening the client. This will have the client pull in a lot of extra dependencies (such as all supported Tier2 adapters) making it pretty large.
- Pros:
References
N/A