Memory Technologies Findings Summary - antimetal/system-agent GitHub Wiki

Memory Leak Detection Research Findings Summary

Executive Summary

After analyzing 20+ approaches for detecting memory leaks in production environments using eBPF and other low-overhead techniques, I've identified several viable strategies that balance detection accuracy with acceptable performance impact.

Key Finding: The Overhead vs Accuracy Trade-off

Critical Insight: Direct malloc/free tracing (the most accurate method) causes 30-400% overhead, making it completely unsuitable for production. This forces us to use indirect methods that trade some accuracy for dramatically lower overhead.

Top Production-Ready Approaches

1. 🥇 Page Fault Tracing (Best Overall Value)

  • Overhead: <1% at normal rates
  • How it works: Monitors page faults to detect memory growth patterns
  • Detects:
    • Sustained RSS growth
    • VSZ vs RSS divergence (allocation without use)
    • Working set expansion
    • Anonymous vs file-backed fault ratios
  • Tools:
    • BCC stackcount
    • bpftrace scripts
    • Custom eBPF programs
  • Limitations: Less precise than direct allocation tracking
  • Why overlooked: Requires understanding of memory subsystem

2. 🥈 Allocator Built-in Profiling (Best Accuracy at Low Overhead)

  • Overhead: ~4% throughput, ~10% P99 latency
  • Options:
    • jemalloc: Most mature, 512KB sampling interval
    • tcmalloc: Google ecosystem, 1GB allocation triggers
    • mimalloc: Lowest overhead but limited profiling
  • How it works: Statistical sampling of allocations
  • Deployment: Simple LD_PRELOAD injection
  • Limitation: Some leaks can evade sampling intervals

3. 🥉 PSI + Memory Metrics (Lowest Overhead)

  • Overhead: Essentially zero
  • What it monitors:
    • Pressure Stall Information (memory pressure %)
    • RSS/VSZ/PSS/USS metrics
    • Memory growth rates
  • Best for: Early warning system
  • Tools:
    • Facebook OOMD
    • systemd-oomd
    • /proc monitoring
  • Limitation: Coarse-grained, requires other tools for root cause

4. Statistical/ML Detection (Most Innovative)

  • Overhead: 0-5% depending on approach
  • Approaches:
    • SWAT (Microsoft): <5% overhead, <10% false positives
    • Precog: ML using only memory utilization
    • Time series analysis: ARIMA models for trend detection
  • Key insight: "Stale" objects (not accessed for long time) are likely leaks
  • Limitation: Requires baseline data and tuning

5. Continuous Profiling Platforms (Most Comprehensive)

  • Overhead: 1-2%
  • Options:
    • Parca: eBPF-based, always-on profiling
    • Pixie: Kubernetes-native, automatic instrumentation
    • Pyroscope: Multi-language support
  • Benefits: Historical data, flame graphs, diff analysis
  • Limitation: Additional infrastructure required

Approaches to Avoid in Production

Direct malloc/free Tracing

  • Tools: BCC memleak (without sampling), eBPF uprobes
  • Overhead: 30-400% (MySQL: 33% throughput loss)
  • Use case: Development only or very brief sampling

Valgrind/Massif

  • Overhead: 20-30x slowdown
  • Problem: Thread serialization
  • Use case: Development/testing only

Heaptrack

  • Overhead: 50-100% (1.5-2x slowdown)
  • Use case: Short debugging sessions only

Surprising Discoveries

1. Page Fault Tracing is Underutilized

Despite having negligible overhead and providing excellent signals for memory growth, it's rarely mentioned in memory leak detection discussions.

2. Modern Allocators Are Production-Ready for Profiling

jemalloc and tcmalloc's ~4% overhead is acceptable for many production workloads, especially compared to the alternative of having no visibility.

3. ByteHound (Rust) is Production-Viable

Despite being less known, ByteHound has been optimized to ~20% overhead and is actively used in production by its author. Key characteristics:

  • Complete allocation tracking (no sampling) via LD_PRELOAD
  • Runtime optimizations: Backtrace deduplication, temporary allocation culling
  • Best use case: Layer 3 deep analysis when other methods fail to identify leak source
  • Trade-off: Requires process restart but provides unmatched detail with call stacks
  • Sweet spot: Critical leak investigations where 20% overhead is temporarily acceptable

4. Machine Learning Works with Minimal Data

Precog can detect leaks using only system memory utilization metrics - no application instrumentation needed.

5. Frame Pointers Are Still Critical

Most eBPF tools require frame pointers for stack traces, but modern compilers omit them by default. This is a common deployment gotcha.

Recommended Implementation Strategy

Three-Layer Approach

Layer 1: Continuous Monitoring (Always On)
├── PSI monitoring (0% overhead)
├── Memory metrics (RSS/VSZ/PSS)
├── Page fault rates (<1% overhead)
└── Triggers anomaly detection

Layer 2: Anomaly Investigation (Triggered)
├── jemalloc profiling (4% overhead)
├── Page fault stack traces
├── Statistical analysis
└── Captures detailed data for ~5-10 minutes

Layer 3: Deep Debugging (On-Demand)
├── BCC memleak with heavy sampling (10-30% overhead)
├── ByteHound full tracing (20% overhead, complete tracking)
├── Heap dumps (process freeze)
└── Run for specific time window

Detection Heuristics

Primary Signals:

  1. RSS growth rate > baseline + 2σ for 5+ minutes
  2. VSZ growing 2x faster than RSS
  3. Page fault rate > 1000/sec sustained
  4. Memory pressure > 10% for 60 seconds
  5. Anonymous page ratio > 80% of faults

Secondary Signals:

  • Working set continuously expanding
  • Allocation sites with no corresponding frees
  • Objects not accessed for > 10 minutes
  • Heap fragmentation increasing

Platform-Specific Recommendations

For Kubernetes Environments

  • Use Pixie: Automatic eBPF instrumentation, no code changes
  • Alternative: Parca for continuous profiling

For Traditional Linux Servers

  • Primary: PSI + page fault tracing
  • Secondary: jemalloc profiling
  • Monitoring: Prometheus + node_exporter

For High-Performance Systems

  • Use: Page fault tracing only
  • Avoid: Any allocator instrumentation
  • Consider: Statistical sampling approaches

For Java Applications

  • Built-in: JVM heap profiling
  • External: AppDynamics or New Relic
  • Note: Different ecosystem, different tools

Critical Implementation Notes

1. Frame Pointer Requirement

# Compilation flags needed
gcc -fno-omit-frame-pointer
# Java needs
-XX:+PreserveFramePointer

2. Kernel Version Dependencies

  • Page fault tracepoints: Linux 4.14+
  • PSI: Linux 4.20+
  • eBPF stack traces: Linux 4.6+

3. Sampling Trade-offs

  • Lower sampling = better accuracy but higher overhead
  • jemalloc default (512KB) is good balance
  • Can be tuned based on allocation patterns

4. Container Considerations

  • Shared memory complicates analysis
  • Need container-aware tools
  • cgroup v2 provides better metrics

Tools Quick Reference

Need Best Tool Overhead Setup Complexity
Continuous monitoring PSI + metrics 0% Low
Quick check Page fault tracing <1% Medium
Accurate profiling jemalloc 4% Low
Root cause analysis BCC memleak (sampled) 10-30% Medium
Complete tracking ByteHound 20% Medium
Historical analysis Parca/Pixie 1-2% High
Development debugging Valgrind 2000% Low

Unexpected Findings

  1. Facebook's OOMD approach (PSI-based) is more sophisticated than most commercial solutions

  2. Statistical profiling (SWAT) achieved production success at Microsoft with minimal overhead

  3. Page fault patterns can distinguish between memory leaks and legitimate growth

  4. Most memory leaks can be detected without tracking every allocation

  5. The "stale object" heuristic (objects not accessed for extended periods) is surprisingly effective

What's Still Hard

  1. Distinguishing leaks from caches - Long-lived objects may be legitimate
  2. Custom allocators - jemalloc/tcmalloc profiling doesn't work
  3. Kernel memory leaks - Different tools needed (kmemleak)
  4. Small, slow leaks - May evade sampling intervals
  5. Root cause identification - Detection ≠ fixing

Future Directions

Emerging Technologies

  • eBPF CO-RE: Write once, run everywhere
  • BTF: Better stack traces without frame pointers
  • io_uring: Potential for lower overhead tracing

Research Areas

  • Better ML models for leak prediction
  • Automatic root cause analysis
  • Production-safe automatic remediation
  • Cross-language leak detection

Conclusion

The key insight is that perfect is the enemy of good in production memory leak detection. While malloc/free tracing provides perfect accuracy, its overhead makes it unusable. Instead, combining multiple low-overhead techniques (page faults + PSI + metrics) provides sufficient signal to detect most leaks while maintaining production viability.

Recommended starting point: Implement page fault tracing with basic threshold detection. This gives excellent visibility with minimal overhead and can catch most memory leaks before they become critical.