Memory Technologies Findings Summary - antimetal/system-agent GitHub Wiki

Memory Leak Detection Research Findings Summary

Executive Summary

After analyzing 20+ approaches for detecting memory leaks in production environments using eBPF and other low-overhead techniques, I've identified several viable strategies that balance detection accuracy with acceptable performance impact.

Key Finding: The Overhead vs Accuracy Trade-off

Critical Insight: Direct malloc/free tracing (the most accurate method) causes 30-400% overhead, making it completely unsuitable for production. This forces us to use indirect methods that trade some accuracy for dramatically lower overhead.

Top Production-Ready Approaches

1. 🥇 Page Fault Tracing (Best Overall Value)

Overhead: <1% at normal rates
How it works: Monitors page faults to detect memory growth patterns
Detects:
- Sustained RSS growth
- VSZ vs RSS divergence (allocation without use)
- Working set expansion
- Anonymous vs file-backed fault ratios
Tools:
- BCC stackcount
- bpftrace scripts
- Custom eBPF programs
Limitations: Less precise than direct allocation tracking
Why overlooked: Requires understanding of memory subsystem

2. 🥈 Allocator Built-in Profiling (Best Accuracy at Low Overhead)

Overhead: ~4% throughput, ~10% P99 latency
Options:
- jemalloc: Most mature, 512KB sampling interval
- tcmalloc: Google ecosystem, 1GB allocation triggers
- mimalloc: Lowest overhead but limited profiling
How it works: Statistical sampling of allocations
Deployment: Simple LD_PRELOAD injection
Limitation: Some leaks can evade sampling intervals

3. 🥉 PSI + Memory Metrics (Lowest Overhead)

Overhead: Essentially zero
What it monitors:
- Pressure Stall Information (memory pressure %)
- RSS/VSZ/PSS/USS metrics
- Memory growth rates
Best for: Early warning system
Tools:
- Facebook OOMD
- systemd-oomd
- /proc monitoring
Limitation: Coarse-grained, requires other tools for root cause

4. Statistical/ML Detection (Most Innovative)

Overhead: 0-5% depending on approach
Approaches:
- SWAT (Microsoft): <5% overhead, <10% false positives
- Precog: ML using only memory utilization
- Time series analysis: ARIMA models for trend detection
Key insight: "Stale" objects (not accessed for long time) are likely leaks
Limitation: Requires baseline data and tuning

5. Continuous Profiling Platforms (Most Comprehensive)

Overhead: 1-2%
Options:
- Parca: eBPF-based, always-on profiling
- Pixie: Kubernetes-native, automatic instrumentation
- Pyroscope: Multi-language support
Benefits: Historical data, flame graphs, diff analysis
Limitation: Additional infrastructure required

Approaches to Avoid in Production

❌ Direct malloc/free Tracing

Tools: BCC memleak (without sampling), eBPF uprobes
Overhead: 30-400% (MySQL: 33% throughput loss)
Use case: Development only or very brief sampling

❌ Valgrind/Massif

Overhead: 20-30x slowdown
Problem: Thread serialization
Use case: Development/testing only

❌ Heaptrack

Overhead: 50-100% (1.5-2x slowdown)
Use case: Short debugging sessions only

Surprising Discoveries

1. Page Fault Tracing is Underutilized

Despite having negligible overhead and providing excellent signals for memory growth, it's rarely mentioned in memory leak detection discussions.

2. Modern Allocators Are Production-Ready for Profiling

jemalloc and tcmalloc's ~4% overhead is acceptable for many production workloads, especially compared to the alternative of having no visibility.

3. ByteHound (Rust) is Production-Viable

Despite being less known, ByteHound has been optimized to ~20% overhead and is actively used in production by its author. Key characteristics:

Complete allocation tracking (no sampling) via LD_PRELOAD
Runtime optimizations: Backtrace deduplication, temporary allocation culling
Best use case: Layer 3 deep analysis when other methods fail to identify leak source
Trade-off: Requires process restart but provides unmatched detail with call stacks
Sweet spot: Critical leak investigations where 20% overhead is temporarily acceptable

4. Machine Learning Works with Minimal Data

Precog can detect leaks using only system memory utilization metrics - no application instrumentation needed.

5. Frame Pointers Are Still Critical

Most eBPF tools require frame pointers for stack traces, but modern compilers omit them by default. This is a common deployment gotcha.

Recommended Implementation Strategy

Three-Layer Approach

Layer 1: Continuous Monitoring (Always On)
├── PSI monitoring (0% overhead)
├── Memory metrics (RSS/VSZ/PSS)
├── Page fault rates (<1% overhead)
└── Triggers anomaly detection

Layer 2: Anomaly Investigation (Triggered)
├── jemalloc profiling (4% overhead)
├── Page fault stack traces
├── Statistical analysis
└── Captures detailed data for ~5-10 minutes

Layer 3: Deep Debugging (On-Demand)
├── BCC memleak with heavy sampling (10-30% overhead)
├── ByteHound full tracing (20% overhead, complete tracking)
├── Heap dumps (process freeze)
└── Run for specific time window

Detection Heuristics

Primary Signals:

RSS growth rate > baseline + 2σ for 5+ minutes
VSZ growing 2x faster than RSS
Page fault rate > 1000/sec sustained
Memory pressure > 10% for 60 seconds
Anonymous page ratio > 80% of faults

Secondary Signals:

Working set continuously expanding
Allocation sites with no corresponding frees
Objects not accessed for > 10 minutes
Heap fragmentation increasing

Platform-Specific Recommendations

For Kubernetes Environments

Use Pixie: Automatic eBPF instrumentation, no code changes
Alternative: Parca for continuous profiling

For Traditional Linux Servers

Primary: PSI + page fault tracing
Secondary: jemalloc profiling
Monitoring: Prometheus + node_exporter

For High-Performance Systems

Use: Page fault tracing only
Avoid: Any allocator instrumentation
Consider: Statistical sampling approaches

For Java Applications

Built-in: JVM heap profiling
External: AppDynamics or New Relic
Note: Different ecosystem, different tools

Critical Implementation Notes

1. Frame Pointer Requirement

# Compilation flags needed
gcc -fno-omit-frame-pointer
# Java needs
-XX:+PreserveFramePointer

2. Kernel Version Dependencies

Page fault tracepoints: Linux 4.14+
PSI: Linux 4.20+
eBPF stack traces: Linux 4.6+

3. Sampling Trade-offs

Lower sampling = better accuracy but higher overhead
jemalloc default (512KB) is good balance
Can be tuned based on allocation patterns

4. Container Considerations

Shared memory complicates analysis
Need container-aware tools
cgroup v2 provides better metrics

Tools Quick Reference

Need	Best Tool	Overhead	Setup Complexity
Continuous monitoring	PSI + metrics	0%	Low
Quick check	Page fault tracing	<1%	Medium
Accurate profiling	jemalloc	4%	Low
Root cause analysis	BCC memleak (sampled)	10-30%	Medium
Complete tracking	ByteHound	20%	Medium
Historical analysis	Parca/Pixie	1-2%	High
Development debugging	Valgrind	2000%	Low

Unexpected Findings

Facebook's OOMD approach (PSI-based) is more sophisticated than most commercial solutions
Statistical profiling (SWAT) achieved production success at Microsoft with minimal overhead
Page fault patterns can distinguish between memory leaks and legitimate growth
Most memory leaks can be detected without tracking every allocation
The "stale object" heuristic (objects not accessed for extended periods) is surprisingly effective

What's Still Hard

Distinguishing leaks from caches - Long-lived objects may be legitimate
Custom allocators - jemalloc/tcmalloc profiling doesn't work
Kernel memory leaks - Different tools needed (kmemleak)
Small, slow leaks - May evade sampling intervals
Root cause identification - Detection ≠ fixing

Future Directions

Emerging Technologies

eBPF CO-RE: Write once, run everywhere
BTF: Better stack traces without frame pointers
io_uring: Potential for lower overhead tracing

Research Areas

Better ML models for leak prediction
Automatic root cause analysis
Production-safe automatic remediation
Cross-language leak detection

Conclusion

The key insight is that perfect is the enemy of good in production memory leak detection. While malloc/free tracing provides perfect accuracy, its overhead makes it unusable. Instead, combining multiple low-overhead techniques (page faults + PSI + metrics) provides sufficient signal to detect most leaks while maintaining production viability.

Recommended starting point: Implement page fault tracing with basic threshold detection. This gives excellent visibility with minimal overhead and can catch most memory leaks before they become critical.