Memory Technologies Production Limited Bytehound - antimetal/system-agent GitHub Wiki

ByteHound

Overview

ByteHound is a sophisticated memory profiler specifically designed for Linux systems, written in Rust. It provides comprehensive memory allocation tracking and analysis capabilities for native applications, with a focus on detecting memory leaks and understanding allocation patterns.

Key Characteristics:

  • Rust-based memory profiler for Linux
  • Complete allocation tracking via LD_PRELOAD interception
  • Custom stack unwinding implementation for performance
  • Web-based visualization UI
  • Production-tested by the author in specific scenarios
  • Multi-architecture support (AMD64, ARM, AArch64, MIPS64)

Performance Characteristics

Overhead Analysis

  • Performance Impact: Variable depending on application - can be significant for performance-critical systems
  • Optimization Focus: Uses custom stack unwinding implementation that is "potentially up to orders of magnitude faster" than traditional profiling tools
  • Production Suitability: Mixed - suitable for investigation periods but may be prohibitive for always-on monitoring
  • Accuracy: High (100% allocation coverage, no sampling)
  • False Positives: Low due to complete tracking

Performance Comparison Context

  • vs Valgrind: Significantly lower overhead than Valgrind's memory tools
  • vs jemalloc profiling: Higher overhead than statistical profiling but provides complete coverage
  • vs TCMalloc profiling: Higher overhead than allocator-based profiling (~4% OPS drop, ~10% P99 latency increase for TCMalloc/jemalloc)
  • Production Reality: TiKV team found overhead too high for their production database workload

Key Features

Core Capabilities

  • Complete Allocation Tracking: Captures every malloc, free, mmap, munmap, and related calls
  • Full Stack Traces: Call stack for every allocation and deallocation
  • Dynamic Culling: Runtime filtering of temporary allocations to reduce data volume
  • Backtrace Deduplication: Runtime backtrace cache with deduplication for reduced data generation
  • No Sampling: 100% coverage of memory operations
  • Real-time Analysis: Can stream profiling data to another machine
  • Multiple Export Formats: JSON, Heaptrack, flamegraph formats

Advanced Features

  • Embedded Scripting: Supports Rhai DSL for custom analysis
  • jemalloc Integration: Special support for jemalloc profiling (AMD64 only)
  • Programmatic Control: Can be started and stopped programmatically within applications
  • Shadow Stack Unwinding: Supported on stable Rust, enabled by default
  • Rust Symbol Demangling: Native support for Rust symbol demangling

System-Agent Implementation Plan

Layer 3 Integration Strategy

ByteHound is positioned as a Layer 3 deep analysis tool for critical memory leak investigations:

Triggered Profiling Approach

# 5-10 minute investigation window
export MEMORY_PROFILER_LOG=warn
export MEMORY_PROFILER_OUTPUT=/tmp/memory-profiling.dat
LD_PRELOAD=./libbytehound.so timeout 600 target-process

Implementation Requirements

  • Process Restart: Applications must be restarted with LD_PRELOAD
  • Duration Limits: 5-10 minute profiling windows to manage overhead
  • Data Collection: Automated collection and transfer of profiling data
  • Analysis Pipeline: Automated report generation from profiling data

Integration Workflow

  1. Trigger Detection: Memory growth alerts or leak suspicion
  2. Process Preparation: Restart target process with ByteHound attached
  3. Data Collection: Gather profiling data over investigation period
  4. Analysis: Generate comprehensive leak analysis reports
  5. Cleanup: Return process to normal operation

Architecture

Interception Mechanism

ByteHound uses LD_PRELOAD to intercept memory allocation functions:

Intercepted Functions

  • malloc, calloc, realloc, free
  • mmap, munmap, mmap64
  • posix_memalign, aligned_alloc
  • memalign, valloc, pvalloc

Data Structures

  • Allocation Tracking: Hash maps for active allocations
  • Stack Trace Cache: Deduplicated backtrace storage
  • Temporal Filtering: Dynamic culling of short-lived allocations

Custom Stack Unwinding

ByteHound's performance advantage comes from its custom stack unwinding implementation:

  • DWARF Preprocessing: DWARF debugging information is preprocessed at startup for faster lookups
  • Dynamic Patching: Return addresses are dynamically patched to reduce traversal overhead
  • Optimized Lookups: Stack trace generation optimized for minimal runtime impact

Output Format

  • Continuous Updates: Data file is continuously updated rather than generated at specific intervals
  • Streaming Support: Can stream data to remote machines for analysis
  • Multiple Formats: Native format plus exports to JSON, Heaptrack, and flamegraph

Production Deployments

Author's Production Experience

The ByteHound author has used the tool in production environments, demonstrating its practical viability for specific use cases.

Success Stories

  • Memory Leak Resolution: Users report successfully identifying and fixing "nasty leaks" using ByteHound
  • Allocation Pattern Analysis: Effective for understanding complex allocation behaviors
  • Development Integration: Used during development cycles for memory optimization

Production Constraints

  • Performance-Critical Systems: May be unsuitable for high-throughput, latency-sensitive applications
  • Limited Always-On Use: Best suited for investigation periods rather than continuous monitoring
  • Resource Requirements: Additional RAM needed for profiling data storage

Layer 3 Tool Positioning

ByteHound is most effective as a specialized investigation tool:

  • Triggered Usage: Deploy when memory issues are suspected
  • Deep Analysis: Comprehensive data for complex leak scenarios
  • Root Cause Analysis: Complete allocation history for thorough investigation

Installation & Setup

Building from Source

# Requirements
# - Rust nightly (version 1.62+)
# - Full GCC toolchain
# - Yarn package manager

# Clone repository
git clone https://github.com/koute/bytehound.git
cd bytehound

# Build profiler
cargo build --release

# Build GUI (requires Yarn)
cd gui
yarn install
yarn build
cd ..

Prebuilt Binaries

Download from GitHub releases: https://github.com/koute/bytehound/releases

Configuration Setup

# Basic profiling setup
export MEMORY_PROFILER_LOG=info
export MEMORY_PROFILER_OUTPUT=memory-profiling.dat

# Advanced configuration
export MEMORY_PROFILER_CULL_TEMPORARY_ALLOCATIONS=1
export MEMORY_PROFILER_GRAB_BACKTRACES=1

Web UI Setup

# Start analysis server
./bytehound server memory-profiling_*.dat

# Access GUI at http://localhost:8080

Code Examples

Integration Wrapper Script

#!/bin/bash
# bytehound-profile.sh - Automated ByteHound profiling wrapper

PROFILE_DURATION=${1:-300}  # Default 5 minutes
TARGET_PROCESS=${2}
OUTPUT_DIR="/tmp/bytehound-profiles"

if [ -z "$TARGET_PROCESS" ]; then
    echo "Usage: $0 [duration_seconds] <target_process>"
    exit 1
fi

# Setup
mkdir -p "$OUTPUT_DIR"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
PROFILE_FILE="$OUTPUT_DIR/memory-profile_${TIMESTAMP}.dat"

# Configure ByteHound
export MEMORY_PROFILER_LOG=warn
export MEMORY_PROFILER_OUTPUT="$PROFILE_FILE"
export MEMORY_PROFILER_CULL_TEMPORARY_ALLOCATIONS=1

echo "Starting ByteHound profiling for $PROFILE_DURATION seconds..."

# Start profiling
timeout "$PROFILE_DURATION" LD_PRELOAD=./libbytehound.so "$TARGET_PROCESS" &
PROFILE_PID=$!

# Wait for completion
wait $PROFILE_PID

echo "Profiling complete. Data saved to: $PROFILE_FILE"
echo "Start analysis with: ./bytehound server $PROFILE_FILE"

Automated Analysis Setup

#!/usr/bin/env python3
# bytehound-analyzer.py - Automated ByteHound analysis

import subprocess
import json
import sys
from pathlib import Path

class ByteHoundAnalyzer:
    def __init__(self, profile_file):
        self.profile_file = Path(profile_file)
        
    def start_server(self):
        """Start ByteHound analysis server"""
        cmd = ["./bytehound", "server", str(self.profile_file)]
        self.server_proc = subprocess.Popen(cmd)
        
    def export_data(self, format_type="json"):
        """Export profiling data in specified format"""
        output_file = self.profile_file.with_suffix(f".{format_type}")
        cmd = ["./bytehound", "export", str(self.profile_file), "--format", format_type]
        
        with open(output_file, 'w') as f:
            subprocess.run(cmd, stdout=f, check=True)
            
        return output_file
        
    def generate_report(self):
        """Generate summary report from profiling data"""
        # Export as JSON for analysis
        json_file = self.export_data("json")
        
        # Parse and analyze (placeholder for actual analysis logic)
        with open(json_file) as f:
            data = json.load(f)
            
        # Generate summary report
        report = {
            "total_allocations": len(data.get("allocations", [])),
            "peak_memory": self.calculate_peak_memory(data),
            "potential_leaks": self.identify_leaks(data)
        }
        
        return report
        
    def calculate_peak_memory(self, data):
        # Placeholder implementation
        return "N/A"
        
    def identify_leaks(self, data):
        # Placeholder implementation
        return []

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python3 bytehound-analyzer.py <profile_file>")
        sys.exit(1)
        
    analyzer = ByteHoundAnalyzer(sys.argv[1])
    report = analyzer.generate_report()
    
    print(json.dumps(report, indent=2))

System Agent Integration Code

// bytehound-integration.go - System agent ByteHound integration
package main

import (
    "context"
    "fmt"
    "os"
    "os/exec"
    "path/filepath"
    "time"
)

type ByteHoundProfiler struct {
    ProfileDuration time.Duration
    OutputDir       string
    LibraryPath     string
}

func NewByteHoundProfiler() *ByteHoundProfiler {
    return &ByteHoundProfiler{
        ProfileDuration: 5 * time.Minute,
        OutputDir:       "/tmp/bytehound-profiles",
        LibraryPath:     "./libbytehound.so",
    }
}

func (bp *ByteHoundProfiler) ProfileProcess(ctx context.Context, processName string, args []string) error {
    // Ensure output directory exists
    if err := os.MkdirAll(bp.OutputDir, 0755); err != nil {
        return fmt.Errorf("failed to create output directory: %w", err)
    }
    
    // Generate unique profile filename
    timestamp := time.Now().Format("20060102_150405")
    profileFile := filepath.Join(bp.OutputDir, fmt.Sprintf("memory-profile_%s.dat", timestamp))
    
    // Setup environment
    env := os.Environ()
    env = append(env, "MEMORY_PROFILER_LOG=warn")
    env = append(env, fmt.Sprintf("MEMORY_PROFILER_OUTPUT=%s", profileFile))
    env = append(env, "MEMORY_PROFILER_CULL_TEMPORARY_ALLOCATIONS=1")
    env = append(env, fmt.Sprintf("LD_PRELOAD=%s", bp.LibraryPath))
    
    // Create command with timeout
    ctx, cancel := context.WithTimeout(ctx, bp.ProfileDuration)
    defer cancel()
    
    cmd := exec.CommandContext(ctx, processName, args...)
    cmd.Env = env
    
    // Start profiling
    if err := cmd.Run(); err != nil {
        if ctx.Err() == context.DeadlineExceeded {
            // Expected timeout - profiling complete
            fmt.Printf("ByteHound profiling completed. Data saved to: %s\n", profileFile)
            return nil
        }
        return fmt.Errorf("profiling failed: %w", err)
    }
    
    return nil
}

func (bp *ByteHoundProfiler) AnalyzeProfile(profileFile string) error {
    // Start analysis server
    cmd := exec.Command("./bytehound", "server", profileFile)
    return cmd.Start()
}

Optimization Techniques

Backtrace Deduplication

ByteHound implements runtime backtrace caching to reduce data volume:

  • Hash-based Deduplication: Identical stack traces are stored once and referenced
  • Dynamic Cache Management: Cache is managed to balance memory usage and lookup performance
  • Reduced Storage: Significantly reduces profile data size for applications with repetitive allocation patterns

Temporary Allocation Filtering

Dynamic culling of short-lived allocations:

  • Lifetime Thresholds: Configurable thresholds for allocation lifetime
  • Runtime Decision Making: Decisions made during profiling to avoid post-processing overhead
  • Long-term Profiling: Enables extended profiling sessions without excessive data accumulation

Custom Stack Unwinding

The core performance optimization in ByteHound:

  • DWARF Preprocessing: Debug information is preprocessed at startup for O(1) lookups
  • Return Address Patching: Dynamic modification of return addresses for efficient stack traversal
  • Minimal Runtime Overhead: Stack trace generation optimized to minimize impact on target application

Data Compression Strategies

  • Incremental Updates: Continuous file updates rather than periodic dumps
  • Structured Storage: Efficient binary format for profiling data
  • Stream Processing: Real-time data streaming to reduce local storage requirements

Comparison with Alternatives

vs Valgrind Memcheck

Aspect ByteHound Valgrind Memcheck
Overhead Moderate (custom unwinding) Very High (~20x slowdown)
Coverage 100% allocations 100% allocations + bounds checking
Production Use Limited investigation periods Development only
Setup Complexity LD_PRELOAD Command wrapper
Real-time Analysis Yes (web UI) Post-mortem analysis

vs jemalloc Profiling

Aspect ByteHound jemalloc Profiling
Overhead Higher Lower (~4% OPS impact)
Coverage Complete tracking Statistical sampling
Restart Required Yes (LD_PRELOAD) No (runtime toggle)
Analysis Depth Full allocation history Aggregate statistics
Integration External tool Built-in allocator feature

vs BCC memleak

Aspect ByteHound BCC memleak
Overhead Moderate Lower (eBPF)
Restart Required Yes No
Kernel Requirements Standard Linux BPF-enabled kernel
Analysis UI Web-based GUI Command-line output
Historical Data Complete timeline Point-in-time snapshots

vs TCMalloc Profiling

Aspect ByteHound TCMalloc Profiling
Overhead Higher Lower (~4% OPS, ~10% P99 latency)
Coverage All allocations Sampled allocations
Configuration Environment variables Runtime heap profiling
Visualization Web UI pprof integration
Production Deployment Investigation periods Continuous profiling possible

Sweet Spot Analysis

ByteHound is optimal for:

  • Critical Leak Investigations: When complete allocation history is needed
  • Complex Memory Pattern Analysis: Understanding intricate allocation behaviors
  • Development Phase Profiling: Deep analysis during development cycles
  • One-off Investigations: Thorough analysis of specific memory issues

Not optimal for:

  • Always-on Production Monitoring: Overhead may be prohibitive
  • High-frequency Profiling: Better suited for focused investigation periods
  • Performance-critical Systems: May impact application performance too significantly

Web UI Features

Timeline Visualization

  • Memory Usage Timeline: Visual representation of memory usage over time
  • Allocation Rate Charts: Graphs showing allocation and deallocation rates
  • Interactive Timeline: Zoom and pan through profiling session timeline
  • Event Correlation: Correlate allocation events with application behavior

Allocation Site Ranking

  • Top Allocators: Ranked list of allocation sites by total memory allocated
  • Leak Suspects: Allocations sites with high net allocation (allocated - deallocated)
  • Frequency Analysis: Most frequently called allocation sites
  • Size Distribution: Analysis of allocation size patterns

Memory Growth Analysis

  • Growth Trend Analysis: Identification of memory growth patterns
  • Leak Detection: Automatic identification of potential memory leaks
  • Allocation Lifetime Analysis: Understanding of allocation lifecycle patterns
  • Memory Map Visualization: Visual representation of memory layout

Interactive Exploration

  • Stack Trace Navigation: Click-through navigation of allocation call stacks
  • Source Code Integration: Jump to source code locations (when debug info available)
  • Filtering and Search: Advanced filtering capabilities for specific allocation patterns
  • Export Capabilities: Export analysis results in various formats

Advanced Analysis Features

  • Allocation Clustering: Group similar allocations for pattern analysis
  • Temporal Analysis: Understand how allocation patterns change over time
  • Memory Fragmentation Analysis: Identify memory fragmentation issues
  • Custom Queries: Use embedded scripting for custom analysis queries

Conclusion

ByteHound represents a sophisticated approach to memory profiling, offering complete allocation tracking with optimized performance characteristics. While its overhead makes it unsuitable for always-on production monitoring, it excels as a Layer 3 investigation tool for complex memory leak analysis.

The tool's custom stack unwinding implementation and comprehensive feature set make it particularly valuable for:

  • Deep memory leak investigations
  • Understanding complex allocation patterns
  • Development-phase memory optimization
  • Root cause analysis of memory-related performance issues

Organizations should evaluate ByteHound's overhead in their specific environment and use it strategically for focused investigation periods when comprehensive memory analysis is required.

See Also