Memory Technologies Findings Summary - antimetal/system-agent GitHub Wiki
Memory Leak Detection Research Findings Summary
Executive Summary
After analyzing 20+ approaches for detecting memory leaks in production environments using eBPF and other low-overhead techniques, I've identified several viable strategies that balance detection accuracy with acceptable performance impact.
Key Finding: The Overhead vs Accuracy Trade-off
Critical Insight: Direct malloc/free tracing (the most accurate method) causes 30-400% overhead, making it completely unsuitable for production. This forces us to use indirect methods that trade some accuracy for dramatically lower overhead.
Top Production-Ready Approaches
1. 🥇 Page Fault Tracing (Best Overall Value)
- Overhead: <1% at normal rates
- How it works: Monitors page faults to detect memory growth patterns
- Detects:
- Sustained RSS growth
- VSZ vs RSS divergence (allocation without use)
- Working set expansion
- Anonymous vs file-backed fault ratios
- Tools:
- BCC stackcount
- bpftrace scripts
- Custom eBPF programs
- Limitations: Less precise than direct allocation tracking
- Why overlooked: Requires understanding of memory subsystem
2. 🥈 Allocator Built-in Profiling (Best Accuracy at Low Overhead)
- Overhead: ~4% throughput, ~10% P99 latency
- Options:
- jemalloc: Most mature, 512KB sampling interval
- tcmalloc: Google ecosystem, 1GB allocation triggers
- mimalloc: Lowest overhead but limited profiling
- How it works: Statistical sampling of allocations
- Deployment: Simple LD_PRELOAD injection
- Limitation: Some leaks can evade sampling intervals
3. 🥉 PSI + Memory Metrics (Lowest Overhead)
- Overhead: Essentially zero
- What it monitors:
- Pressure Stall Information (memory pressure %)
- RSS/VSZ/PSS/USS metrics
- Memory growth rates
- Best for: Early warning system
- Tools:
- Facebook OOMD
- systemd-oomd
- /proc monitoring
- Limitation: Coarse-grained, requires other tools for root cause
4. Statistical/ML Detection (Most Innovative)
- Overhead: 0-5% depending on approach
- Approaches:
- SWAT (Microsoft): <5% overhead, <10% false positives
- Precog: ML using only memory utilization
- Time series analysis: ARIMA models for trend detection
- Key insight: "Stale" objects (not accessed for long time) are likely leaks
- Limitation: Requires baseline data and tuning
5. Continuous Profiling Platforms (Most Comprehensive)
- Overhead: 1-2%
- Options:
- Parca: eBPF-based, always-on profiling
- Pixie: Kubernetes-native, automatic instrumentation
- Pyroscope: Multi-language support
- Benefits: Historical data, flame graphs, diff analysis
- Limitation: Additional infrastructure required
Approaches to Avoid in Production
❌ Direct malloc/free Tracing
- Tools: BCC memleak (without sampling), eBPF uprobes
- Overhead: 30-400% (MySQL: 33% throughput loss)
- Use case: Development only or very brief sampling
❌ Valgrind/Massif
- Overhead: 20-30x slowdown
- Problem: Thread serialization
- Use case: Development/testing only
❌ Heaptrack
- Overhead: 50-100% (1.5-2x slowdown)
- Use case: Short debugging sessions only
Surprising Discoveries
1. Page Fault Tracing is Underutilized
Despite having negligible overhead and providing excellent signals for memory growth, it's rarely mentioned in memory leak detection discussions.
2. Modern Allocators Are Production-Ready for Profiling
jemalloc and tcmalloc's ~4% overhead is acceptable for many production workloads, especially compared to the alternative of having no visibility.
3. ByteHound (Rust) is Production-Viable
Despite being less known, ByteHound has been optimized to ~20% overhead and is actively used in production by its author. Key characteristics:
- Complete allocation tracking (no sampling) via LD_PRELOAD
- Runtime optimizations: Backtrace deduplication, temporary allocation culling
- Best use case: Layer 3 deep analysis when other methods fail to identify leak source
- Trade-off: Requires process restart but provides unmatched detail with call stacks
- Sweet spot: Critical leak investigations where 20% overhead is temporarily acceptable
4. Machine Learning Works with Minimal Data
Precog can detect leaks using only system memory utilization metrics - no application instrumentation needed.
5. Frame Pointers Are Still Critical
Most eBPF tools require frame pointers for stack traces, but modern compilers omit them by default. This is a common deployment gotcha.
Recommended Implementation Strategy
Three-Layer Approach
Layer 1: Continuous Monitoring (Always On)
├── PSI monitoring (0% overhead)
├── Memory metrics (RSS/VSZ/PSS)
├── Page fault rates (<1% overhead)
└── Triggers anomaly detection
Layer 2: Anomaly Investigation (Triggered)
├── jemalloc profiling (4% overhead)
├── Page fault stack traces
├── Statistical analysis
└── Captures detailed data for ~5-10 minutes
Layer 3: Deep Debugging (On-Demand)
├── BCC memleak with heavy sampling (10-30% overhead)
├── ByteHound full tracing (20% overhead, complete tracking)
├── Heap dumps (process freeze)
└── Run for specific time window
Detection Heuristics
Primary Signals:
- RSS growth rate > baseline + 2σ for 5+ minutes
- VSZ growing 2x faster than RSS
- Page fault rate > 1000/sec sustained
- Memory pressure > 10% for 60 seconds
- Anonymous page ratio > 80% of faults
Secondary Signals:
- Working set continuously expanding
- Allocation sites with no corresponding frees
- Objects not accessed for > 10 minutes
- Heap fragmentation increasing
Platform-Specific Recommendations
For Kubernetes Environments
- Use Pixie: Automatic eBPF instrumentation, no code changes
- Alternative: Parca for continuous profiling
For Traditional Linux Servers
- Primary: PSI + page fault tracing
- Secondary: jemalloc profiling
- Monitoring: Prometheus + node_exporter
For High-Performance Systems
- Use: Page fault tracing only
- Avoid: Any allocator instrumentation
- Consider: Statistical sampling approaches
For Java Applications
- Built-in: JVM heap profiling
- External: AppDynamics or New Relic
- Note: Different ecosystem, different tools
Critical Implementation Notes
1. Frame Pointer Requirement
# Compilation flags needed
gcc -fno-omit-frame-pointer
# Java needs
-XX:+PreserveFramePointer
2. Kernel Version Dependencies
- Page fault tracepoints: Linux 4.14+
- PSI: Linux 4.20+
- eBPF stack traces: Linux 4.6+
3. Sampling Trade-offs
- Lower sampling = better accuracy but higher overhead
- jemalloc default (512KB) is good balance
- Can be tuned based on allocation patterns
4. Container Considerations
- Shared memory complicates analysis
- Need container-aware tools
- cgroup v2 provides better metrics
Tools Quick Reference
Need | Best Tool | Overhead | Setup Complexity |
---|---|---|---|
Continuous monitoring | PSI + metrics | 0% | Low |
Quick check | Page fault tracing | <1% | Medium |
Accurate profiling | jemalloc | 4% | Low |
Root cause analysis | BCC memleak (sampled) | 10-30% | Medium |
Complete tracking | ByteHound | 20% | Medium |
Historical analysis | Parca/Pixie | 1-2% | High |
Development debugging | Valgrind | 2000% | Low |
Unexpected Findings
-
Facebook's OOMD approach (PSI-based) is more sophisticated than most commercial solutions
-
Statistical profiling (SWAT) achieved production success at Microsoft with minimal overhead
-
Page fault patterns can distinguish between memory leaks and legitimate growth
-
Most memory leaks can be detected without tracking every allocation
-
The "stale object" heuristic (objects not accessed for extended periods) is surprisingly effective
What's Still Hard
- Distinguishing leaks from caches - Long-lived objects may be legitimate
- Custom allocators - jemalloc/tcmalloc profiling doesn't work
- Kernel memory leaks - Different tools needed (kmemleak)
- Small, slow leaks - May evade sampling intervals
- Root cause identification - Detection ≠ fixing
Future Directions
Emerging Technologies
- eBPF CO-RE: Write once, run everywhere
- BTF: Better stack traces without frame pointers
- io_uring: Potential for lower overhead tracing
Research Areas
- Better ML models for leak prediction
- Automatic root cause analysis
- Production-safe automatic remediation
- Cross-language leak detection
Conclusion
The key insight is that perfect is the enemy of good in production memory leak detection. While malloc/free tracing provides perfect accuracy, its overhead makes it unusable. Instead, combining multiple low-overhead techniques (page faults + PSI + metrics) provides sufficient signal to detect most leaks while maintaining production viability.
Recommended starting point: Implement page fault tracing with basic threshold detection. This gives excellent visibility with minimal overhead and can catch most memory leaks before they become critical.