Exploratory Matrix - antimetal/system-agent GitHub Wiki

Memory Leak Detection Technologies Comparison Matrix

Overview

This matrix compares all researched memory leak detection approaches across multiple dimensions including performance overhead, accuracy, deployment complexity, and production readiness.

Scoring Legend

  • Overhead: Percentage performance impact (lower is better)
  • Accuracy: 🟢 High (>90%) | 🟡 Medium (60-90%) | 🔴 Low (<60%)
  • False Positives: 🟢 Low (<10%) | 🟡 Medium (10-30%) | 🔴 High (>30%)
  • Setup Complexity: 🟢 Easy | 🟡 Moderate | 🔴 Complex
  • Production Ready: ✅ Yes | ⚠️ Limited | ❌ No

Comprehensive Comparison Matrix

Technology Overhead Accuracy False Positives Granularity Setup Prod Ready Restart Required Stack Traces Platform Requirements Key Limitations
Page Fault Tracing <1% 🟡 Medium 🟢 Low Coarse 🟢 Easy ✅ Yes No Yes* Linux 4.14+, Frame pointers Indirect detection only
jemalloc Profiling ~4% 🟢 High 🟢 Low Fine 🟢 Easy ✅ Yes No** Yes LD_PRELOAD support Sampling may miss small leaks
tcmalloc Profiling ~5% 🟢 High 🟢 Low Fine 🟢 Easy ✅ Yes No** Yes LD_PRELOAD support Google ecosystem focused
mimalloc Profiling ~2% 🟡 Medium 🟢 Low Medium 🟢 Easy ✅ Yes Yes Limited Windows/Linux/macOS Limited profiling features
PSI + Metrics 0% 🔴 Low 🔴 High Very Coarse 🟢 Easy ✅ Yes No No Linux 4.20+ Detection only, no root cause
Hardware PMCs 0% 🟡 Medium 🟡 Medium Coarse 🔴 Complex ⚠️ Limited No No Intel/AMD CPUs, root access Requires expertise to interpret
SWAT (Statistical) <5% 🟢 High 🟢 Low Fine 🟡 Moderate ✅ Yes No Yes Windows/Linux Requires baseline period
Precog (ML) ~1% 🟡 Medium 🟡 Medium Coarse 🟡 Moderate ⚠️ Limited No No Training data required Needs historical data
BCC memleak (sampled) 10-30% 🟢 High 🟢 Low Fine 🟡 Moderate ⚠️ Limited No Yes Linux 4.6+, BCC tools Still significant overhead
BCC memleak (full) 30-400% 🟢 High 🟢 Low Very Fine 🟡 Moderate ❌ No No Yes Linux 4.6+, BCC tools Unsuitable for production
ByteHound ~20% 🟢 High 🟢 Low Very Fine 🟡 Moderate ⚠️ Limited Yes Yes Linux, Rust runtime Requires process restart
Parca 1-2% 🟡 Medium 🟢 Low Fine 🔴 Complex ✅ Yes No Yes Kubernetes, eBPF Additional infrastructure
Pixie 1-2% 🟡 Medium 🟢 Low Fine 🔴 Complex ✅ Yes No Yes Kubernetes only K8s specific
Pyroscope 1-2% 🟡 Medium 🟢 Low Fine 🟡 Moderate ✅ Yes No Yes Multi-platform Server infrastructure needed
Valgrind/Massif 2000% 🟢 High 🟢 Low Very Fine 🟢 Easy ❌ No Yes Yes Linux/macOS Dev only, serializes threads
Heaptrack 50-100% 🟢 High 🟢 Low Very Fine 🟢 Easy ❌ No Yes Yes Linux Dev/debug only
ASAN 200-300% 🟢 High 🟢 Low Very Fine 🟡 Moderate ❌ No Rebuild Yes Compiler support Requires recompilation
LSAN 150-200% 🟢 High 🟢 Low Very Fine 🟡 Moderate ❌ No Rebuild Yes LLVM/GCC Requires recompilation
brk/mmap Tracing <1% 🔴 Low 🔴 High Very Coarse 🟢 Easy ✅ Yes No Yes Linux, eBPF Only heap expansion
LeakGuard 5-10% 🟢 High 🟢 Low Fine 🟡 Moderate ⚠️ Limited No Yes Research prototype Not widely available
GenCount 5-15% 🟡 Medium 🟡 Medium Fine 🟡 Moderate ⚠️ Limited No Yes Research prototype Academic tool
Sleigh 10-20% 🟡 Medium 🟡 Medium Fine 🟡 Moderate ⚠️ Limited No Yes Research prototype Limited deployment

*Requires frame pointers to be enabled (-fno-omit-frame-pointer) **Can be enabled at runtime with mallctl() or environment variables

Detailed Capability Matrix

Technology Detects Slow Leaks Detects Fast Leaks Kernel Memory User Memory Language Agnostic Real-time Detection Historical Analysis Root Cause Analysis
Page Fault Tracing ⚠️
jemalloc Profiling ⚠️
tcmalloc Profiling ⚠️
mimalloc Profiling ⚠️ ⚠️ ⚠️
PSI + Metrics
Hardware PMCs
SWAT (Statistical) ⚠️
Precog (ML) ⚠️ ⚠️
BCC memleak ✅*
ByteHound
Parca ⚠️
Pixie ⚠️
Pyroscope
Valgrind/Massif
Heaptrack
ASAN/LSAN

*BCC memleak can trace kernel allocations (kmalloc/kfree)

Use Case Recommendations

By Deployment Scenario

Scenario Primary Choice Secondary Choice Avoid
Always-on Production PSI + Page Faults jemalloc (4% acceptable) Valgrind, ASAN, Full memleak
Kubernetes Pixie Parca Non-container aware tools
High-Performance Systems Hardware PMCs Page Fault Tracing Any allocator instrumentation
Development/Testing Valgrind/ASAN Heaptrack -
Quick Investigation jemalloc profiling Sampled BCC memleak Full tracing
Deep Root Cause Analysis ByteHound Full BCC memleak Surface-level metrics
Java Applications JVM Native Tools - C/C++ specific tools
Embedded Systems Custom lightweight mimalloc Heavy profilers

By Leak Characteristics

Leak Type Best Tools Why
Slow, gradual leaks PSI + Metrics, Page Faults Low overhead for long-term monitoring
Fast, obvious leaks jemalloc, tcmalloc Quick detection with stack traces
Small, intermittent ByteHound, Full memleak Need complete tracking
Unknown source SWAT, Statistical approaches Pattern recognition helps
Container escapes Pixie, Parca Container-aware
Kernel memory BCC memleak (kernel mode) Specialized for kernel

Quantitative Performance Comparison

Metric Best Performers Acceptable Poor
CPU Overhead PMCs (0%), PSI (0%) Page Faults (<1%), Parca (1-2%) Valgrind (2000%)
Memory Overhead PSI, PMCs, Page Faults jemalloc (~10%) ASAN (2-3x)
Latency Impact PMCs, PSI jemalloc (+10% P99) Valgrind (serialization)
Detection Speed Direct tracing Statistical (minutes) ML approaches (hours)
Accuracy Valgrind, ASAN, ByteHound jemalloc, BCC PSI, Metrics only

Implementation Effort Comparison

Approach Lines of Code Dependencies Maintenance Expertise Required
PSI Monitoring ~50 /proc/pressure Low Low
Page Fault eBPF ~200 BCC/bpftrace Medium Medium
jemalloc Integration ~100 jemalloc lib Low Low
PMC Analysis ~500 perf, PAPI High High
ML Detection ~1000+ sklearn, data pipeline High High
ByteHound ~50 (usage) ByteHound binary Medium Medium
Parca/Pixie ~200 K8s, operators High Medium

Cost-Benefit Analysis

High Value (Low Cost, High Benefit)

  1. Page Fault Tracing: <1% overhead, good detection
  2. PSI Monitoring: 0% overhead, early warning
  3. jemalloc (existing users): If already using, enable profiling

Medium Value

  1. Allocator Switch: 4% overhead but need migration
  2. Sampled BCC: 10-30% overhead, periodic use only
  3. Statistical Approaches: Need tuning and baseline

Low Value (High Cost, Limited Benefit)

  1. Full malloc/free tracing: 30-400% overhead
  2. Valgrind in production: 2000% overhead
  3. Custom ML solutions: High development cost

Decision Tree

Is this production?
├─ Yes
│  ├─ Can tolerate 4% overhead?
│  │  ├─ Yes → jemalloc/tcmalloc profiling
│  │  └─ No → Page fault tracing + PSI
│  └─ Kubernetes environment?
│     ├─ Yes → Pixie or Parca
│     └─ No → Standard Linux tools
└─ No (Development)
   ├─ Need complete accuracy?
   │  ├─ Yes → Valgrind or ASAN
   │  └─ No → Heaptrack or ByteHound
   └─ Quick check only?
      └─ Yes → jemalloc one-time profile

Key Insights from Matrix

  1. No Silver Bullet: No single tool excels at all dimensions
  2. Overhead vs Accuracy Trade-off: Universal across all approaches
  3. Production Viability Threshold: ~5% overhead is the practical limit
  4. Layered Approach Optimal: Combine low-overhead detection with targeted deep analysis
  5. Platform Matters: Kubernetes environments have specialized, superior tools
  6. Frame Pointers Critical: Most eBPF tools require them but modern compilers omit by default
  7. Statistical Sampling Works: 4% overhead for 90%+ accuracy is achievable
  8. Hardware Counters Underutilized: Zero overhead but require expertise