PDP 53 (Robust Garbage Collection for SLTS) - derekm/pravega GitHub Wiki

Status: Draft.

Summary

This PDP discusses a change to our SLTS garbage collection implementation that would make it more robust and comprehensively cover additional failure modes without adding complexity.

Background

SLTS manages chunks to create abstraction of segment that can be appended at tail end and truncated at front end.

When does SLTS delete chunks ?

Delete all chunks in segment after segment is deleted
Delete chunks after segment is truncated
Delete chunks after segment is de-fragmented
Delete newly added chunks after failure to persist transaction
Delete journal chunks after snapshot

Why does SLTS need delayed deletion of chunks ?

SLTS needs delayed deletion for chunks for following reason

To prevent zombie SS instance from making destructive deletes that can potentially affect normal operations.
SLTS allows multiple parallel reads at the same time when segment is being truncated or merged. Therefore it is possible that read requests are in flight while chunks are deleted by truncate. Delayed delete prevents need to serialize such requests.
Background GC allows SLTS to prioritize user requests over maintenance tasks like deleting chunks.

Shortcomings of the current implementation

As explained, instead of deleting the chunks immediately we must add chunk to GC queue.

How current implementation works

Current beta implementation uses in memory priority queue to maintain list of chunks to be deleted.
During SLTS operation only the metadata for chunk is marked for deletion and the work item is added to GC queue.
A background thread pulls work items from this queue and deletes chunks if corresponding chunk metadata is marked for deletion.
The actual deletion task happens on storage thread.

Limitations

The current garbage collection is implemented partially in an incremental fashion and does not cover all the failure modes. More specifically implementation does not handle following failure modes and therefore permanently creates/leaves garbage behind.

The list of garbage chunks is maintained only in memory and therefore after SS restart, this list of garbage chunks is lost. (Persistent queue is planned but not implemented yet)
The in memory queue is throttled and will not accept new GC tasks when max queue size is reached. This is a potential source of garbage leak
For the chunks that have been created on LTS but not yet added to the metadata, when there is failure such chunks are leaked.

In addition to design problems, the current implementation is consuming memory by keeping data in memory.

Now the implementing solution for all unimplemented failure modes above is already planned, however during design discussions it became obvious that extending existing design will result in adding multiple ad-hoc solutions to handle each case.

Root Cause

The main cause for all these problem is two fold

GC work item queue is not persisted.
We don't add chunks to GC queue until transaction is committed. This means there is a gap between when chunks become garbage and when they are added to GC queue. Any failure during that gap results in garbage leak.

Proposal

Below is the summary of this proposal.

Add persistent GC Task queue using internal Event Stream

A task stream is created for each container using Segment Container Event Streams.
- For each container the stream name is fixed to be GC_queue_<conainer_id>
- As explained in the PDP 51 , the tasks queued in the stream are persisted via durable log.
Types of GC Tasks : Three types of work items are added to the queue
- DeleteSegmentTask : Deletes all chunks in given segment
- AddNewChunk : Conditionally deletes a single chunk in case of failure.
- DeleteChunkListTask: Deletes a linked sub list of chunks. The task includes name of the first and last garbage chunk in the list.
- DeleteJournalChunkListTask : Deletes sub list of journal chunks.
Each task has start time associated with it, after which it is safe to delete the chunk. Task is not executed until is becomes eligible
- All chunked are delayed by the same fixed amount of time.
Each task includes transaction id of the SLTS transaction that created them. This id is used to check whether transaction has successfully committed or not

Change current GC design to read from the persistent GC queue

A GC background thread pulls GC tasks from the queue
- The GC waits until the Task at the head of the queue becomes eligible.
- If transaction that created the GC task is still in progress then the GC task is re-enqueued again and next task is processed.
In case of any failure
- Correctly handles the case when chunk is already deleted or metadata is already deleted.
- The delete operation performed in a retry loop for fixed number of times. After that the task is enqueued again for the fixed number of times.
- The GC task is attempted a fixed number of times after which the task is added to a separate "dead-letter queue" named GC_failed_queue_<conainer_id>. It is expected that in future an admin tool will read this dead letter queue.

Change SLTS operations to optimistically enqueue tasks for GC

The main idea here is that most SLTS operations succeed therefore we optimistically add all the to be deleted chunks to GC queue before transaction is committed.
- The SLTS operations continues to mark metadata for deletion exactly the same way as before
- If transaction fails to commit, the metadata is not changed and therefore chunks are not marked for deletion
Below is how this will work for various operations

Operation	New Behavior	On success	On failure
Write	This operation adds new chunks, the newly added chunks are also added to GC queue immediately after creation	After commit the metadata for newly added chunks is marked as active. Therefore when GC task for newly added chunks are processed chunks will not be deleted.	There will not be any metadata for given chunk or GC Task, therefore the chunk is treated as garbage and deleted
Truncate	This operation deletes unneeded chunks. The chunks to be deleted are added to GC queue before committing metadata transaction.	After commit the change to mark chunk as deleted is committed. Therefore when GC task is processed - it will delete all the chunks marked for deletion in that commit.	The metadata for the chunk GC task is still marked active , therefore the chunk is not deleted.
Concat	This operation potentially replaces small chunks with few big chunks. It also deletes unneeded chunks when underlying ChunkStorage does not natively support cleanup after merge. The list of chunks to be deleted and list of new chunks is added to GC queue before committing metadata transaction.	After commit all deleted chunks are marked inactive and all new chunks are marked active. Therefore when GC task is processed it will delete inactive chunks and will not delete active chunks.	There will not be any metadata for new chunks while the old chunks will be still active, therefore old chunks are not deleted and newly added chunks are deleted.
Delete	Instead of enumerating all chunks and adding them individually, DeleteSegmentTask is added to the GC. The segment metadata is marked as in-active. During GC processing all chunks are enumerated and deleted one by one	During GC processing all chunks are enumerated and deleted one by one	The metadata for the segment is unchanged hence in the metadata the segment is still active, therefore no chunk will be deleted.

Minimize or limit number of metadata record changed in single operation

SLTS operations are implemented so that number of metadata records updated in the single transaction is minimized
- For delete operation instead of deleting all chunks and index entries at once, only the segment metadata is marked as deleted in the foreground task. The actual deleting of chunk data, metadata and affected index entries is performed by the background garbage collection task.
- For truncate operation instead of deleting all affected chunks and index entries at once, only the segment metadata is modified and background task on the segment is added in the foreground task. The actual deleting of chunk metadata and index entries is performed by the background garbage collection task.
- For defragment` operation instead of deleting all affected chunks and index entries at once, only the segment metadata is modified and background task on the segment is added in the foreground task. The actual deleting of chunk metadata and index entries is performed by the background garbage collection task.
The during GC processing we generally operate on small batch of chunks or metadata at a time hence we modify only small number of metadata records in any transactions.

More fine grained throttling

SLTS will now throttle 2 types of operations separately
- IO oriented calls to delete chunks
- metadata oriented calls to update or enumerate metadata.

Out of scope

This PDP does not change following

Garbage collector uses no more than fixed percentage of storage threads at any time.

Key Design Changes

Public API Changes

None

Public Config Changes

remove garbage.collection.concurrency.max
add garbage.collection.concurrency.batch.size - Number of tasks processed in a single batch
add garbage.collection.concurrency.io.max - Max number of concurrent IO operations executed by GC at any time.
add garbage.collection.concurrency.metadata.max Max number of concurrent metadata operations executed by GC at any time.

Internal API changes

Replace current GarbageCollector::addToGarbage method with more strongly typed method

void addTask(GCTask task)
The task is one of
- DeleteSegmentTask
- TrackNewChunk
- DeleteChunkListTask
- DeleteJournalChunkListTask

Internal functionality change

The changes are limited to internals of GarbageCollector class

Use Event stream as described above.
All SLTS operations are changed to add chunks/segments to GC queue in optimistic manner

Observability

Metrics

Size of GC queue.
Number of tasks enqueued ( perhaps break down by task type)
Number of chunks deleted.
Number of tasks skipped. (no-op)
Number of tasks re-enqueued.
Number of tasks failed.
Number of times task was attempted.
Time to complete task.

Health Checks

(Exact details to be decided later as this depends on Health check PDP)

Key Concerns

Bootstrap

The SLTS GarbageCollector is initialized post SLTS bootstrap and therefore does not need any special handling during bootstrap itself.
The first SystemJournal snapshot is also generated after bootstrap is complete. Therefore SystemJournal does not need to delete chunks during SystemJournal::bootstrap
As far as the SLTS is concerned the segments underlying the GC queue are treated like any other segment.

Scalability and Concurrency

The implementation uses non-blocking async programming model.
Each container uses its own dedicated GC queue. However within a container this queue is a shared resource.

Performance Impact

Many SLTS operations now have an indirect dependency on Durable log

Compatibility and Migration

Currently there are no persisted data structures related to GC that need to be migrated.
All serialized data structures will be versioned.

Fault Scenarios

Fault Scenario	How it is handled
failure during GC	The delete operation performed in a retry loop for fixed number of times. After that the task is enqueued again for the fixed number of times. The GC task is attempted a fixed number of times after which the task is added to a separate "dead-letter queue" named GC_failed_queue_<conainer_id>. It is expected that in future an admin tool will read this dead letter queue.
Inability to write to event stream	This will cause the SLTS operation to fail without commiting
Durable log is fenced out	DataLogWriterNotPrimaryException wil be bubbled up to the caller.
Segment store crash	Please see the section on optimistic delete above

Assumptions

Event stream segment is treated as system critical and not throttled.
The typical clock skew on nodes is order of magnitude smaller than GC delay.
Same GC delay setting works for all cases where we delete chunks.
Amount of metadata written is several order of magnitude smaller than the data itself. (Less than 0.01% overhead)

Limitations

Introduces new dependency on durable log
Can support only one global value for delay . (all chunks are deleted after fixed amount of time)
Note that this design does not handle garbage chunks already present on system. It only handles GC for chunks created after this change.

Discarded Approaches

Following options were considered for implementing persistent GC queue

Maintaining the GC queue in SLTS metadata itself
- This is a good solution but it does not handle newly created chunks
- Addition to GC queue has to be serialized or at least thread safe and has potential to become a bottle neck.
Periodically persisting GC queue to special chunks
- This approach has added complexity to handle zombie and failover scenarios.
- Does not handle new chunk garbage leak during write failures
Periodically scan SLTS metadata and filter out
- disadvantage is that this involves whole table scans and does not
- complexity in efficient implementation.
Periodically scan LTS to discover orphaned chunks
- Not efficient at all.
- Needs admin tools and adds to deployment complexity.

None of the approaches above solve for all the failure modes at the same time and we may end up implementing all of the above to cover all scenarios.

PDP 53 (Robust Garbage Collection for SLTS) - derekm/pravega GitHub Wiki

Summary

Background

When does SLTS delete chunks ?

Why does SLTS need delayed deletion of chunks ?

Shortcomings of the current implementation

How current implementation works

Limitations

Root Cause

Proposal

Out of scope

Key Design Changes

Public API Changes

Public Config Changes

Internal API changes

Internal functionality change

Observability

Metrics

Health Checks

Key Concerns

Bootstrap

Scalability and Concurrency

Performance Impact

Compatibility and Migration

Fault Scenarios

Assumptions

Limitations

Discarded Approaches

References