Memory Technologies Production Ready Brk Mmap Tracing - antimetal/system-agent GitHub Wiki
brk/mmap System Call Tracing
Overview
System call tracing for memory management operations provides a coarse-grained but production-safe approach to detecting memory leaks and growth patterns. This method focuses on tracing heap expansion (brk/sbrk) and memory mapping (mmap/munmap) system calls to monitor virtual memory allocation patterns at the kernel level.
- Traces heap expansion (brk) and memory mapping (mmap) syscalls
- Very low overhead (<1%)
- Coarse-grained but production-safe
- Detects heap growth patterns
- Shows virtual memory expansion patterns
Unlike fine-grained malloc/free tracing, system call tracing operates at the kernel boundary where memory segments are actually allocated or expanded, providing insights into underlying memory management behavior without the overhead of user-space function hooking.
Performance Characteristics
Metric | Value |
---|---|
Overhead | <1% |
Accuracy | Low (very coarse) |
False Positives | High |
Production Ready | Yes |
Platform | Linux with eBPF |
Frequency | Low (syscalls are infrequent) |
Granularity | System-level, not allocation-specific |
The extremely low overhead makes this approach suitable for continuous monitoring in production environments. However, the coarse granularity means it serves best as an early warning system rather than a precise diagnostic tool.
System Calls Traced
Core Memory Management System Calls
brk() - Heap Segment Expansion
- Purpose: Changes the end of the data segment (program break)
- Usage: Traditional heap growth mechanism
- Modern Context: Less common due to allocator changes
- Kernel Function:
sys_brk()
orSyS_brk()
- Tracepoint:
syscalls:sys_enter_brk
(Linux 4.14+)
sbrk() - Heap Size Changes
- Purpose: Incremental heap size adjustment
- Usage: Wrapper around brk() for relative changes
- Return Value: Previous program break address
- Implementation: Usually implemented via brk()
mmap() - Memory Mapping
- Purpose: Maps files or anonymous memory into address space
- Usage: Large allocations (>MMAP_THRESHOLD, typically 128KB)
- Kernel Function:
sys_mmap()
orSyS_mmap()
- Tracepoint:
syscalls:sys_enter_mmap
(Linux 4.14+) - Flags: MAP_ANONYMOUS for heap-like allocations
munmap() - Memory Unmapping
- Purpose: Removes memory mappings
- Usage: Frees large allocated blocks
- Kernel Function:
sys_munmap()
- Tracepoint:
syscalls:sys_enter_munmap
(Linux 4.14+)
mremap() - Memory Remapping
- Purpose: Expands or moves existing memory mappings
- Usage: Realloc operations on large blocks
- Kernel Function:
sys_mremap()
- Tracepoint:
syscalls:sys_enter_mremap
(Linux 4.14+)
Allocation Strategy Context
Modern malloc implementations use different strategies:
- Small allocations: Usually from pre-allocated pools
- Medium allocations: Traditional brk/sbrk heap expansion
- Large allocations: Direct mmap calls (glibc default >128KB)
System-Agent Implementation Plan
eBPF Programs for Syscall Tracing
// Pseudo-code structure for eBPF program
struct syscall_event {
u32 pid;
u32 tid;
u64 timestamp;
u64 size;
u64 addr;
u32 syscall_id;
char comm[16];
};
// Attach to syscall tracepoints
SEC("tracepoint/syscalls/sys_enter_brk")
int trace_brk_enter(struct trace_event_raw_sys_enter *ctx);
SEC("tracepoint/syscalls/sys_enter_mmap")
int trace_mmap_enter(struct trace_event_raw_sys_enter *ctx);
SEC("tracepoint/syscalls/sys_exit_mmap")
int trace_mmap_exit(struct trace_event_raw_sys_exit *ctx);
Growth Pattern Detection Algorithm
- Baseline Establishment: Track normal allocation patterns per process
- Trend Analysis: Detect sustained growth over time windows
- Threshold Monitoring: Alert on size or frequency anomalies
- Correlation: Match with process memory metrics and PSI data
Integration Points
- Layer 1 Monitoring: Feed into existing metrics pipeline
- Process Correlation: Link with process memory stats from /proc
- Alert Generation: Trigger detailed profiling when patterns detected
- Historical Analysis: Store trends for capacity planning
How It Works
Syscall Entry/Exit Tracing
The eBPF programs attach to kernel tracepoints that fire when system calls are invoked:
- Entry Hook: Capture parameters (size, flags, addresses)
- Exit Hook: Capture return values and success/failure status
- Event Generation: Package data for user-space analysis
- Filtering: Apply PID/process filters to reduce noise
Size Tracking Methodology
# Track cumulative allocation sizes per process
brk_size[pid] += new_brk - old_brk
mmap_size[pid] += allocation_size
munmap_size[pid] -= deallocation_size
net_growth[pid] = mmap_size[pid] - munmap_size[pid] + brk_size[pid]
Frequency Analysis
- Call Rate Monitoring: Syscalls per second per process
- Burst Detection: Unusual allocation patterns
- Periodicity Analysis: Regular allocation cycles
- Growth Rate Calculation: Size increase over time
Code Examples
bpftrace Script for Basic Monitoring
#!/usr/bin/env bpftrace
// brk-mmap-monitor.bt - Monitor memory allocation syscalls
BEGIN {
printf("Monitoring brk/mmap syscalls. Ctrl-C to end.\n");
printf("%-8s %-16s %-8s %-12s %-8s\n", "TIME", "COMM", "PID", "SYSCALL", "SIZE");
}
// Trace brk() system calls
tracepoint:syscalls:sys_enter_brk {
$brk = args->brk;
$old_brk = @brk_size[pid];
$delta = $brk > $old_brk ? $brk - $old_brk : 0;
if ($delta > 0) {
printf("%-8u %-16s %-8d %-12s %8d\n",
elapsed / 1000000, comm, pid, "brk", $delta);
@brk_size[pid] = $brk;
@total_brk[pid] += $delta;
}
}
// Trace mmap() system calls for anonymous mappings
tracepoint:syscalls:sys_enter_mmap {
$flags = args->flags;
$size = args->len;
// Focus on anonymous mappings (heap-like allocations)
if ($flags & 0x20) { // MAP_ANONYMOUS
printf("%-8u %-16s %-8d %-12s %8d\n",
elapsed / 1000000, comm, pid, "mmap", $size);
@mmap_size[pid] += $size;
@mmap_count[pid]++;
}
}
// Trace munmap() system calls
tracepoint:syscalls:sys_enter_munmap {
$size = args->len;
printf("%-8u %-16s %-8d %-12s %8d\n",
elapsed / 1000000, comm, pid, "munmap", $size);
@munmap_size[pid] += $size;
@munmap_count[pid]++;
}
// Summary on exit
END {
printf("\nSummary by Process:\n");
printf("%-16s %-8s %-12s %-12s %-8s %-8s\n",
"COMM", "PID", "BRK_TOTAL", "MMAP_TOTAL", "MMAP_CNT", "UNMAP_CNT");
// Print per-process summaries
// (Note: bpftrace syntax for iteration varies by version)
}
BCC Python Program for Advanced Analysis
#!/usr/bin/env python3
# brk-mmap-tracer.py - Advanced memory syscall tracer
from bcc import BPF
from time import sleep
import argparse
# eBPF program
bpf_source = """
#include <uapi/linux/ptrace.h>
#include <linux/sched.h>
struct event_t {
u32 pid;
u32 tid;
u64 timestamp;
u64 size;
u64 addr;
u32 syscall;
char comm[TASK_COMM_LEN];
};
BPF_PERF_OUTPUT(events);
BPF_HASH(brk_size, u32, u64);
BPF_HASH(process_stats, u32, u64);
// Trace brk syscall
TRACEPOINT_PROBE(syscalls, sys_enter_brk) {
struct event_t event = {};
u32 pid = bpf_get_current_pid_tgid() >> 32;
u64 *prev_brk = brk_size.lookup(&pid);
u64 curr_brk = args->brk;
if (prev_brk && curr_brk > *prev_brk) {
event.pid = pid;
event.tid = bpf_get_current_pid_tgid() & 0xffffffff;
event.timestamp = bpf_ktime_get_ns();
event.size = curr_brk - *prev_brk;
event.addr = curr_brk;
event.syscall = 1; // brk
bpf_get_current_comm(&event.comm, sizeof(event.comm));
events.perf_submit(ctx, &event, sizeof(event));
brk_size.update(&pid, &curr_brk);
}
return 0;
}
// Trace mmap syscall
TRACEPOINT_PROBE(syscalls, sys_enter_mmap) {
struct event_t event = {};
u32 flags = args->flags;
// Only trace anonymous mappings
if (flags & 0x20) { // MAP_ANONYMOUS
event.pid = bpf_get_current_pid_tgid() >> 32;
event.tid = bpf_get_current_pid_tgid() & 0xffffffff;
event.timestamp = bpf_ktime_get_ns();
event.size = args->len;
event.addr = 0; // Will be filled by return probe
event.syscall = 2; // mmap
bpf_get_current_comm(&event.comm, sizeof(event.comm));
events.perf_submit(ctx, &event, sizeof(event));
}
return 0;
}
"""
class MemoryTracer:
def __init__(self, pid=None):
self.pid = pid
self.bpf = BPF(text=bpf_source)
self.syscall_names = {1: 'brk', 2: 'mmap', 3: 'munmap'}
def handle_event(self, cpu, data, size):
event = self.bpf["events"].event(data)
syscall = self.syscall_names.get(event.syscall, 'unknown')
print(f"{event.timestamp/1e9:.6f} {event.comm.decode():16} "
f"{event.pid:8d} {syscall:8} {event.size:12d}")
def run(self):
print("Tracing memory allocation syscalls... Ctrl-C to exit")
print(f"{'TIME':>16} {'COMM':16} {'PID':8} {'SYSCALL':8} {'SIZE':12}")
self.bpf["events"].open_perf_buffer(self.handle_event)
try:
while True:
self.bpf.perf_buffer_poll()
except KeyboardInterrupt:
pass
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Trace brk/mmap syscalls")
parser.add_argument("-p", "--pid", type=int, help="Process ID to trace")
args = parser.parse_args()
tracer = MemoryTracer(pid=args.pid)
tracer.run()
Growth Pattern Analysis Script
#!/bin/bash
# analyze-growth-patterns.sh - Analyze collected syscall data
# Run bpftrace and collect data
sudo bpftrace brk-mmap-monitor.bt > /tmp/syscall-trace.log &
TRACE_PID=$!
# Let it run for monitoring period
sleep 300 # 5 minutes
# Stop tracing
kill $TRACE_PID
# Analysis
echo "Top processes by total brk growth:"
awk '/brk/ {brk[$3]+=$NF} END {for(pid in brk) print brk[pid], pid}' \
/tmp/syscall-trace.log | sort -nr | head -10
echo "Top processes by mmap count:"
awk '/mmap/ {mmap[$3]++} END {for(pid in mmap) print mmap[pid], pid}' \
/tmp/syscall-trace.log | sort -nr | head -10
echo "Processes with unbalanced mmap/munmap:"
awk '/mmap/ {mmap[$3]++} /munmap/ {munmap[$3]++}
END {for(pid in mmap)
if(mmap[pid] - munmap[pid] > 5)
print pid, mmap[pid] - munmap[pid]}' \
/tmp/syscall-trace.log
Detection Patterns
Continuous brk() Increases
- Pattern: Steady, incremental brk() calls over time
- Indication: Traditional heap growth, possibly from malloc fragmentation
- Threshold: >10MB total growth without corresponding shrinkage
- Time Window: 5-minute intervals for trend detection
Large mmap() Allocations
- Pattern: Individual mmap() calls >1MB
- Indication: Large object allocations or buffer creation
- Threshold: Single allocations >128MB (configurable)
- Frequency: >100 large allocations per minute
Unmatched mmap/munmap Ratios
- Pattern: mmap calls significantly exceed munmap calls
- Calculation:
mmap_count - munmap_count > threshold
- Threshold: >50% imbalance over 10-minute window
- Weighting: Consider allocation sizes, not just call counts
Frequency Anomalies
- Pattern: Unusual syscall frequency compared to baseline
- Detection: Statistical deviation from historical averages
- Baseline: 7-day moving average per process
- Threshold: >3 standard deviations from baseline
Growth Rate Analysis
# Calculate growth velocity
growth_rate = (current_size - previous_size) / time_delta
acceleration = (current_rate - previous_rate) / time_delta
# Alert conditions
if growth_rate > threshold_rate:
alert("High growth rate detected")
if acceleration > threshold_acceleration:
alert("Accelerating memory growth")
Monitoring & Alerting
Growth Rate Thresholds
Tier 1 - Information
- brk growth: >1MB/minute sustained for 5 minutes
- mmap growth: >10MB/minute sustained for 5 minutes
- Action: Log event, no immediate alert
Tier 2 - Warning
- brk growth: >5MB/minute sustained for 10 minutes
- mmap growth: >50MB/minute sustained for 10 minutes
- Unbalanced ratio: >70% mmap without corresponding munmap
- Action: Generate warning alert, tag for review
Tier 3 - Critical
- brk growth: >20MB/minute sustained for 15 minutes
- mmap growth: >200MB/minute sustained for 15 minutes
- Memory exhaustion risk: Growth rate projects to 90% memory usage
- Action: Critical alert, trigger detailed profiling
Allocation Size Limits
# Alert configuration
size_thresholds:
single_mmap_warning: 128MB
single_mmap_critical: 512MB
cumulative_brk_warning: 100MB
cumulative_brk_critical: 500MB
frequency_thresholds:
mmap_per_second_warning: 10
mmap_per_second_critical: 50
brk_per_second_warning: 5
brk_per_second_critical: 20
Pattern Matching Rules
# Alert rule examples
def evaluate_memory_patterns(process_data):
alerts = []
# Sustained growth pattern
if detect_sustained_growth(process_data, window=300):
alerts.append("sustained_memory_growth")
# Allocation without deallocation
if calculate_allocation_balance(process_data) < 0.3:
alerts.append("poor_deallocation_ratio")
# Frequency spike
if detect_frequency_anomaly(process_data):
alerts.append("syscall_frequency_anomaly")
return alerts
Integration with Existing Systems
- Prometheus Metrics: Export syscall statistics as time series
- Grafana Dashboards: Visualize growth patterns and trends
- PagerDuty Integration: Route critical alerts to on-call teams
- Log Aggregation: Ship detailed events to centralized logging
Limitations
Very Coarse Granularity
- No Allocation Sites: Cannot identify specific code locations causing leaks
- No Call Stacks: Missing context about what triggered allocations
- Virtual vs Physical: Tracks virtual memory allocation, not actual usage
- Aggregated View: Cannot distinguish between many small vs few large allocations
High False Positive Rate
- Normal Growth: Applications legitimately growing memory usage
- Caching Behavior: Memory maps used for caching appear as leaks
- Batch Processing: Periodic large allocations appear anomalous
- Initialization Phase: Startup allocations trigger false alerts
Missing Critical Information
- No Leak Source Identification: Cannot pinpoint leaking functions
- No Object-Level Tracking: Cannot track specific data structures
- No Allocation Lifetime: Cannot determine how long allocations persist
- Limited Context: Missing application-level semantics
Platform Limitations
- Linux-Specific: eBPF implementation tied to Linux kernel
- Kernel Version: Requires modern kernel for tracepoint support
- Permission Requirements: Needs root/CAP_BPF capabilities
- Architecture Dependencies: Some features may vary by CPU architecture
Use Cases
Early Warning System
- Primary Role: First-line defense against memory exhaustion
- Integration: Trigger more expensive detailed profiling tools
- Baseline Establishment: Learn normal allocation patterns per service
- Capacity Planning: Track long-term memory growth trends
Heap Growth Monitoring
- Traditional Allocators: Monitor brk-based heap expansion
- Modern Allocators: Track mmap-based large allocations
- Fragmentation Detection: Identify inefficient heap usage patterns
- Allocator Performance: Compare allocation strategies across processes
Large Allocation Detection
- Buffer Management: Detect oversized buffer allocations
- Memory-Intensive Operations: Identify processes consuming large memory blocks
- Resource Planning: Understand peak memory requirements
- Anomaly Detection: Flag unusual large allocation patterns
Supplementary Signal for Comprehensive Monitoring
- Multi-Layer Approach: Combine with malloc tracing, PSI metrics, page fault analysis
- Correlation Analysis: Cross-reference with application metrics
- Root Cause Analysis: Provide high-level context for detailed investigations
- Historical Trends: Long-term memory usage pattern analysis
Comparison with Alternatives
vs malloc() Tracing (BCC memleak, etc.)
Aspect | System Call Tracing | malloc() Tracing |
---|---|---|
Overhead | <1% | 5-20% |
Granularity | Very coarse | Fine-grained |
Production Use | Always safe | Risky in high-throughput |
Call Stack | No | Yes |
Allocation Sites | No | Yes |
False Positives | High | Low |
Best Use Case | Early warning | Precise diagnosis |
vs Page Fault Analysis
Aspect | System Call Tracing | Page Fault Tracing |
---|---|---|
Signal Type | Virtual allocation | Physical access |
Timing | Allocation time | Access time |
Memory Pressure | Indirect | Direct |
Write vs Read | No distinction | Can distinguish |
Performance Impact | Very low | Low-medium |
Use Case | Growth patterns | Usage patterns |
vs PSI (Pressure Stall Information)
Aspect | System Call Tracing | PSI Metrics |
---|---|---|
Granularity | Per-process | System-wide |
Real-time | Event-based | Polling |
Memory Pressure | Predictive | Current |
Overhead | Minimal | Near zero |
Actionability | High | Medium |
Complement | Yes | Yes |
Best Combined Strategy
System call tracing works best as part of a layered approach:
- Layer 1: System call tracing (continuous, low overhead)
- Layer 2: PSI metrics + page fault analysis (context)
- Layer 3: Detailed malloc tracing (triggered by Layer 1/2 alerts)
- Layer 4: Application profiling (when precise diagnosis needed)
eBPF Implementation
Tracepoint Attachment Strategy
// Modern approach using tracepoints (Linux 4.14+)
SEC("tracepoint/syscalls/sys_enter_brk")
int trace_brk_enter(struct trace_event_raw_sys_enter *ctx) {
u64 brk_addr = ctx->args[0];
u32 pid = bpf_get_current_pid_tgid() >> 32;
// Process brk syscall
return handle_brk_syscall(ctx, pid, brk_addr);
}
// Fallback for older kernels using kprobes
SEC("kprobe/sys_brk")
int kprobe_sys_brk(struct pt_regs *ctx) {
u64 brk_addr = PT_REGS_PARM1(ctx);
u32 pid = bpf_get_current_pid_tgid() >> 32;
return handle_brk_syscall(ctx, pid, brk_addr);
}
Syscall Arguments Extraction
// mmap syscall argument structure
struct mmap_args {
unsigned long addr;
unsigned long len;
unsigned long prot;
unsigned long flags;
unsigned long fd;
unsigned long offset;
};
SEC("tracepoint/syscalls/sys_enter_mmap")
int trace_mmap_enter(struct trace_event_raw_sys_enter *ctx) {
struct mmap_args *args = (struct mmap_args *)ctx->args;
// Filter for anonymous mappings
if (args->flags & MAP_ANONYMOUS) {
struct event_t event = {};
event.size = args->len;
event.flags = args->flags;
// ... populate event and submit
}
return 0;
}
Return Value Processing
SEC("tracepoint/syscalls/sys_exit_mmap")
int trace_mmap_exit(struct trace_event_raw_sys_exit *ctx) {
long ret = ctx->ret;
u32 pid = bpf_get_current_pid_tgid() >> 32;
if (ret > 0) {
// Successful allocation
struct allocation_t alloc = {};
alloc.addr = ret;
alloc.timestamp = bpf_ktime_get_ns();
// Store for tracking
allocations.update(&pid, &alloc);
}
return 0;
}
Error Handling and Edge Cases
// Handle edge cases and errors
static int handle_syscall_error(long ret_code) {
switch (ret_code) {
case -ENOMEM:
// Out of memory - important signal
increment_oom_counter();
return 1;
case -EINVAL:
// Invalid arguments - likely application bug
increment_invalid_args_counter();
return 0;
default:
return 0;
}
}
Memory Management for eBPF Maps
// Efficient map structures
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 10240);
__type(key, u32); // PID
__type(value, struct process_memory_stats);
} process_stats SEC(".maps");
// Cleanup old entries to prevent map overflow
static void cleanup_old_entries(void) {
u64 current_time = bpf_ktime_get_ns();
u64 cutoff_time = current_time - (5 * 60 * 1000000000ULL); // 5 minutes
// Iterate and cleanup (pseudo-code, actual implementation varies)
// bpf_for_each_map_elem(&process_stats, cleanup_callback, &cutoff_time, 0);
}
Integration Strategies
Combine with PSI (Pressure Stall Information)
class IntegratedMemoryMonitor:
def __init__(self):
self.syscall_tracer = SyscallTracer()
self.psi_monitor = PSIMonitor()
def analyze_memory_pressure(self):
# Get current PSI metrics
psi_data = self.psi_monitor.get_memory_pressure()
# Get recent syscall activity
syscall_data = self.syscall_tracer.get_recent_activity()
# Correlate signals
if psi_data.memory_pressure > 0.1 and syscall_data.growth_rate > threshold:
return self.trigger_detailed_analysis()
def trigger_detailed_analysis(self):
"""Launch more expensive profiling when signals align"""
return DetailedProfiler().start_profiling()
Correlate with Process Metrics
# Correlation script
#!/bin/bash
# Collect syscall data
sudo bpftrace syscall-tracer.bt > /tmp/syscalls.log &
TRACER_PID=$!
# Collect process memory stats
while true; do
for pid in $(pgrep -f "target_process"); do
echo "$(date +%s) $pid $(cat /proc/$pid/status | grep VmRSS | awk '{print $2}')"
done
sleep 10
done > /tmp/process_memory.log &
PROC_PID=$!
# Run for collection period
sleep 300
# Cleanup
kill $TRACER_PID $PROC_PID
# Correlate data
python3 << EOF
import pandas as pd
import numpy as np
# Load and correlate data
syscall_df = pd.read_csv('/tmp/syscalls.log', sep=' ',
names=['timestamp', 'comm', 'pid', 'syscall', 'size'])
memory_df = pd.read_csv('/tmp/process_memory.log', sep=' ',
names=['timestamp', 'pid', 'rss_kb'])
# Find correlation between syscall activity and RSS growth
correlation = np.corrcoef(syscall_df.groupby('pid')['size'].sum(),
memory_df.groupby('pid')['rss_kb'].max())
print(f"Correlation coefficient: {correlation[0,1]:.3f}")
EOF
Trigger Detailed Profiling
class AdaptiveProfiler:
def __init__(self):
self.syscall_monitor = SyscallMonitor()
self.profiler_active = False
def monitor_loop(self):
while True:
metrics = self.syscall_monitor.get_metrics()
if self.should_trigger_profiling(metrics):
self.start_detailed_profiling()
elif self.should_stop_profiling(metrics):
self.stop_detailed_profiling()
time.sleep(30)
def should_trigger_profiling(self, metrics):
"""Decide when to start expensive profiling"""
return (metrics.growth_rate > GROWTH_THRESHOLD and
metrics.allocation_frequency > FREQ_THRESHOLD and
not self.profiler_active)
def start_detailed_profiling(self):
"""Start malloc-level tracing"""
subprocess.Popen(['bcc-memleak', '-p', str(self.target_pid)])
self.profiler_active = True
def stop_detailed_profiling(self):
"""Stop expensive profiling"""
subprocess.call(['pkill', 'bcc-memleak'])
self.profiler_active = False
Dashboard Integration
# Grafana dashboard configuration
dashboard:
title: "Memory System Call Monitoring"
panels:
- title: "brk() Growth Rate by Process"
type: "graph"
targets:
- expr: 'rate(brk_total_bytes[5m])'
legendFormat: '{{process}}'
- title: "mmap/munmap Balance"
type: "graph"
targets:
- expr: 'mmap_total_count - munmap_total_count'
legendFormat: 'Unbalanced {{process}}'
- title: "Large Allocation Events"
type: "table"
targets:
- expr: 'mmap_large_allocations > 134217728' # >128MB
This comprehensive documentation provides a complete guide to implementing brk/mmap system call tracing for memory leak detection, covering both theoretical foundations and practical implementation details. The approach serves as an excellent first-line defense in a multi-layered memory monitoring strategy.