Memory Technologies Production Limited SWAT Statistical - antimetal/system-agent GitHub Wiki
SWAT (Statistical Adaptive Profiling) is Microsoft Research's production-ready memory leak detector that achieves <5% runtime overhead with <10% false positives. Developed by Trishul M. Chilimbi (Microsoft Research) and Matthias Hauswirth (University of Colorado at Boulder), SWAT detects "stale" objects - memory allocations not accessed for a configurable time period - as potential memory leaks.
The tool has been successfully deployed across multiple Microsoft product groups for 18+ months (as of 2004), proving its effectiveness in production environments where traditional memory profilers would impose prohibitive overhead.
Metric | Performance |
---|---|
Runtime Overhead | <5% |
Space Overhead | <10% (often <5%) |
False Positive Rate | <10% |
Detection Accuracy | Up to 72% improvement over existing techniques |
Production Ready | Yes (18+ months at Microsoft) |
Platform Support | Windows, Linux |
- Production Feasible: Low enough overhead to run continuously in production systems
- Long-Running Leak Detection: Can track leaks that take days to manifest
- Complete Coverage: Finds all leaks that occur during program execution
- Debug Context: Provides last access location for leaked objects
SWAT's core innovation is adaptive sampling that addresses traditional profiling limitations:
Sampling Rate = 1 / Execution Frequency
Problem Solved: Traditional uniform sampling provides poor coverage of infrequently executed code paths where bugs often hide.
Solution: Sample code segments at rates inversely proportional to their execution frequency, ensuring comprehensive coverage across all execution paths.
1. Allocation Tracking:
- Hook malloc/calloc/realloc/free calls
- Build heap model of all active allocations
- Track allocation metadata (size, timestamp, call stack)
2. Access Monitoring:
- Use adaptive profiling to monitor load/store operations
- Record last access timestamp for each allocated object
- Maintain access history with minimal overhead
3. Staleness Analysis:
- Calculate staleness = current_time - last_access_time
- Compare against configurable staleness threshold
- Objects exceeding threshold flagged as potential leaks
4. Leak Reporting:
- Generate reports for stale objects
- Include allocation call stack and last access location
- Provide debugging context for leak resolution
SWAT maintains a lightweight heap model:
typedef struct heap_object {
void* address; // Object memory address
size_t size; // Allocation size
uint64_t alloc_time; // Allocation timestamp
uint64_t last_access; // Last access timestamp
call_stack_t alloc_stack; // Allocation call stack
call_stack_t access_stack; // Last access call stack
} heap_object_t;
SWAT employs statistical analysis to reduce false positives:
- Confidence Intervals: Calculate statistical confidence in staleness measurements
- Access Pattern Analysis: Consider historical access patterns
- Threshold Adaptation: Dynamically adjust staleness thresholds based on program behavior
// Adaptive sampling rate calculation
double calculate_sampling_rate(code_segment_t* segment) {
double execution_frequency = segment->execution_count / total_executions;
double base_rate = 0.01; // 1% base sampling rate
return base_rate / execution_frequency;
}
// Sampling decision
bool should_sample(code_segment_t* segment) {
double rate = calculate_sampling_rate(segment);
return (random_double() < rate);
}
// Object access hook
void on_memory_access(void* address, access_type_t type) {
heap_object_t* object = find_heap_object(address);
if (object && should_sample(current_segment)) {
object->last_access = get_timestamp();
if (type == ACCESS_WRITE) {
object->access_stack = capture_call_stack();
}
}
}
// Staleness detection
bool is_stale(heap_object_t* object, uint64_t threshold) {
uint64_t current_time = get_timestamp();
return (current_time - object->last_access) > threshold;
}
// Memory allocation hook
void* hooked_malloc(size_t size) {
void* ptr = original_malloc(size);
if (ptr) {
heap_object_t object = {
.address = ptr,
.size = size,
.alloc_time = get_timestamp(),
.last_access = get_timestamp(),
.alloc_stack = capture_call_stack()
};
register_heap_object(object);
}
return ptr;
}
// Memory deallocation hook
void hooked_free(void* ptr) {
unregister_heap_object(ptr);
original_free(ptr);
}
// Leak report structure
typedef struct leak_report {
size_t total_leaked_bytes;
size_t object_count;
leak_object_t* objects;
double confidence_score;
} leak_report_t;
// Generate leak report
leak_report_t* generate_leak_report(uint64_t staleness_threshold) {
leak_report_t* report = allocate_report();
for (heap_object_t* obj = heap_objects; obj; obj = obj->next) {
if (is_stale(obj, staleness_threshold)) {
add_to_report(report, obj);
}
}
calculate_confidence(report);
return report;
}
SWAT has been successfully deployed across multiple Microsoft product groups since 2004:
- Deployment Duration: 18+ months of continuous production use
- Product Coverage: Multiple Microsoft software products
-
Success Metrics:
- <10% false positive rate maintained across deployments
- Significant reduction in memory leak-related production issues
- Developer productivity improvements through actionable debugging information
- Long-Running Leaks: Successfully detected leaks that take days to manifest
- Production Viability: Proven feasible for continuous production monitoring
- Developer Adoption: High acceptance due to low false positive rates
- Debugging Efficiency: Last access location significantly reduces debugging time
"Low-overhead Memory Leak Detection Using Adaptive Statistical Profiling"
- Authors: Trishul M. Chilimbi (Microsoft Research), Matthias Hauswirth (University of Colorado at Boulder)
- Conference: ASPLOS XI (11th International Conference on Architectural Support for Programming Languages and Operating Systems)
- Date: October 2004, Boston, MA
- URL: Microsoft Research Publication
- Paper URL: https://people.cs.umass.edu/~emery/classes/cmpsci691s-fall2004/swat_asplos_final.pdf
- Academic Impact: Introduced the concept of "stale" objects for memory leak detection
- Follow-up Research: Influenced Bond and McKinley's work (2006, 2008, 2009) on leak pruning and memory leak tolerance
- Industry Adoption: Demonstrated viability of statistical profiling for production memory management
- GWP-ASan (2024): "Sampling-Based Detection of Memory-Safety Bugs in Production" - continues SWAT's low-overhead approach
- Adaptive Runtime Monitoring (2023): "Software runtime monitoring with adaptive sampling rate" - extends adaptive sampling concepts
- Container-Based Detection: "Precise memory leak detection for java software using container profiling" - applies similar principles to Java environments
#include <stdlib.h>
#include <time.h>
#include <stdint.h>
typedef struct {
uint64_t execution_count;
double last_sampling_rate;
uint64_t total_samples;
} code_segment_stats_t;
// Global execution counter
static uint64_t global_execution_count = 0;
// Calculate adaptive sampling rate
double adaptive_sampling_rate(code_segment_stats_t* segment) {
if (segment->execution_count == 0) {
return 1.0; // Sample first execution
}
double frequency = (double)segment->execution_count / global_execution_count;
double base_rate = 0.01; // 1% base rate
// Inverse proportional sampling
return fmin(base_rate / frequency, 1.0);
}
// Sampling decision
bool should_sample_execution(code_segment_stats_t* segment) {
segment->execution_count++;
global_execution_count++;
double rate = adaptive_sampling_rate(segment);
double random_val = (double)rand() / RAND_MAX;
if (random_val < rate) {
segment->total_samples++;
return true;
}
return false;
}
#include <pthread.h>
#include <sys/time.h>
typedef struct heap_object {
void* address;
size_t size;
uint64_t alloc_timestamp;
uint64_t last_access_timestamp;
pthread_mutex_t access_mutex;
struct heap_object* next;
} heap_object_t;
// Global heap tracking
static heap_object_t* heap_objects = NULL;
static pthread_rwlock_t heap_lock = PTHREAD_RWLOCK_INITIALIZER;
// Get high-resolution timestamp
uint64_t get_timestamp_us() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec * 1000000ULL + tv.tv_usec;
}
// Update object access time
void update_access_time(void* address) {
pthread_rwlock_rdlock(&heap_lock);
heap_object_t* obj = find_object_by_address(address);
if (obj) {
pthread_mutex_lock(&obj->access_mutex);
obj->last_access_timestamp = get_timestamp_us();
pthread_mutex_unlock(&obj->access_mutex);
}
pthread_rwlock_unlock(&heap_lock);
}
// Check if object is stale
bool is_object_stale(heap_object_t* obj, uint64_t threshold_us) {
uint64_t current_time = get_timestamp_us();
uint64_t staleness;
pthread_mutex_lock(&obj->access_mutex);
staleness = current_time - obj->last_access_timestamp;
pthread_mutex_unlock(&obj->access_mutex);
return staleness > threshold_us;
}
#include <math.h>
typedef struct {
double mean;
double std_dev;
size_t sample_count;
double confidence_level;
} staleness_stats_t;
// Calculate statistical confidence in staleness detection
double calculate_confidence(heap_object_t* objects, size_t count) {
if (count < 2) return 0.0;
// Calculate mean staleness
double sum = 0.0;
for (size_t i = 0; i < count; i++) {
uint64_t staleness = get_timestamp_us() - objects[i].last_access_timestamp;
sum += staleness;
}
double mean = sum / count;
// Calculate standard deviation
double variance_sum = 0.0;
for (size_t i = 0; i < count; i++) {
uint64_t staleness = get_timestamp_us() - objects[i].last_access_timestamp;
double diff = staleness - mean;
variance_sum += diff * diff;
}
double std_dev = sqrt(variance_sum / (count - 1));
// Calculate 95% confidence interval
double t_value = 1.96; // For large samples (normal approximation)
double margin_of_error = t_value * (std_dev / sqrt(count));
double confidence = 1.0 - (margin_of_error / mean);
return fmax(0.0, fmin(1.0, confidence));
}
// SWAT integration header
#ifndef SWAT_INTEGRATION_H
#define SWAT_INTEGRATION_H
// Configuration structure
typedef struct {
uint64_t staleness_threshold_us;
double base_sampling_rate;
size_t max_tracked_objects;
bool enable_statistical_analysis;
bool production_mode;
} swat_config_t;
// Initialize SWAT
int swat_init(swat_config_t* config);
// Memory allocation hooks
void* swat_malloc(size_t size);
void* swat_calloc(size_t nmemb, size_t size);
void* swat_realloc(void* ptr, size_t size);
void swat_free(void* ptr);
// Access monitoring
void swat_memory_access(void* address, size_t size, bool is_write);
// Leak detection and reporting
typedef struct {
size_t leak_count;
size_t total_leaked_bytes;
double confidence_score;
heap_object_t** leaked_objects;
} swat_leak_report_t;
swat_leak_report_t* swat_detect_leaks();
void swat_print_report(swat_leak_report_t* report);
void swat_free_report(swat_leak_report_t* report);
// Cleanup
void swat_cleanup();
#endif // SWAT_INTEGRATION_H
// Recommended staleness thresholds by application type
typedef enum {
SWAT_INTERACTIVE_APP, // GUI applications, web servers
SWAT_BATCH_PROCESSING, // Data processing, compilation
SWAT_LONG_RUNNING, // Services, daemons
SWAT_REAL_TIME // Embedded systems, games
} app_type_t;
uint64_t get_recommended_threshold(app_type_t type) {
switch (type) {
case SWAT_INTERACTIVE_APP:
return 30 * 1000000; // 30 seconds
case SWAT_BATCH_PROCESSING:
return 5 * 60 * 1000000; // 5 minutes
case SWAT_LONG_RUNNING:
return 60 * 60 * 1000000; // 1 hour
case SWAT_REAL_TIME:
return 5 * 1000000; // 5 seconds
default:
return 60 * 1000000; // 1 minute default
}
}
// Dynamic sampling rate adjustment
typedef struct {
double current_rate;
double target_overhead;
uint64_t measurement_window;
uint64_t last_adjustment;
} adaptive_config_t;
void adjust_sampling_rate(adaptive_config_t* config, double measured_overhead) {
if (measured_overhead > config->target_overhead * 1.2) {
// Reduce sampling rate
config->current_rate *= 0.8;
} else if (measured_overhead < config->target_overhead * 0.8) {
// Increase sampling rate
config->current_rate *= 1.1;
}
// Clamp to reasonable bounds
config->current_rate = fmax(0.001, fmin(0.1, config->current_rate));
config->last_adjustment = get_timestamp_us();
}
// Memory overhead management
typedef struct {
size_t max_objects;
size_t current_objects;
size_t memory_limit_bytes;
size_t current_memory_usage;
bool overflow_protection;
} memory_limits_t;
bool check_memory_limits(memory_limits_t* limits, size_t requested_size) {
if (limits->current_objects >= limits->max_objects) {
if (limits->overflow_protection) {
// Remove oldest objects to make space
cleanup_oldest_objects(limits->max_objects / 10);
} else {
return false;
}
}
if (limits->current_memory_usage + requested_size > limits->memory_limit_bytes) {
return false;
}
return true;
}
SWAT introduced the groundbreaking concept of using object "staleness" as a proxy for memory leaks. This heuristic is based on the observation that:
- Leaked Objects Are Typically Not Accessed: Memory leaks usually occur when programs lose references to allocated memory, making it impossible to access those objects
- Time-Based Detection: Objects that remain unaccessed for extended periods are strong candidates for memory leaks
- Low False Positive Rate: The staleness heuristic naturally filters out objects that are legitimately unused but will be accessed later
The effectiveness of staleness detection stems from fundamental memory usage patterns:
Legitimate Memory Usage:
- Allocate → Use Frequently → Eventually Free
- Allocate → Use Occasionally → Keep for Future Use
Memory Leak Patterns:
- Allocate → Use Initially → Lose Reference → Never Access Again
- Allocate → Never Use → Forget to Free
SWAT employs statistical analysis to quantify confidence in leak detection:
- Threshold Significance: Uses configurable confidence intervals (typically 95%)
- Pattern Recognition: Analyzes access patterns to distinguish normal usage from abandonment
- False Positive Minimization: Statistical validation reduces false alarms to <10%
Approach | Runtime Overhead | Space Overhead | Production Viable |
---|---|---|---|
SWAT (Statistical) | <5% | <10% | ✅ Yes |
Direct Tracing | 50-200% | 100-500% | ❌ No |
Valgrind | 1000-3000% | 300-1000% | ❌ No |
Advantage: SWAT's statistical sampling dramatically reduces overhead while maintaining detection effectiveness.
Approach | Signal Quality | Platform Support | Implementation Complexity |
---|---|---|---|
SWAT | Direct object access tracking | Cross-platform | Moderate |
Hardware Counters | Indirect cache/TLB signals | Limited to specific CPUs | High |
PMU-based | Statistical approximation | Intel/AMD specific | Very High |
Advantage: SWAT provides direct evidence of object abandonment rather than inferring from indirect hardware signals.
Approach | Complexity | Training Requirements | Production Readiness |
---|---|---|---|
SWAT | Rule-based heuristic | None | ✅ 18+ months proven |
Machine Learning | High algorithmic complexity | Requires training data | |
Neural Networks | Very high | Large labeled datasets | ❌ Not production-ready |
Advantage: SWAT's straightforward heuristic approach requires no training, has predictable behavior, and has extensive production validation.
Based on experimental results with SPEC CINT2000 benchmarks:
- Detection Improvement: Up to 72% more memory leaks found compared to existing data sampling techniques
- Overhead Comparison: Comparable overhead to traditional sampling while providing superior detection rates
- False Positive Rates: Consistently maintains <10% false positives across different workloads
- Start Conservative: Begin with higher staleness thresholds and lower sampling rates
- Gradual Tuning: Adjust parameters based on observed false positive rates and detection effectiveness
- Production Monitoring: Implement feedback loops to automatically tune parameters based on production workloads
- Debugging Integration: Provide rich debugging information including call stacks and access patterns
- Thread Safety: Implement proper synchronization for multi-threaded environments
- Memory Management: Use bounded data structures to prevent SWAT itself from consuming excessive memory
- Performance Monitoring: Continuously monitor SWAT's own overhead and adjust sampling accordingly
- Integration Testing: Thoroughly test with representative production workloads before deployment
The SWAT approach represents a mature, production-proven solution for memory leak detection that balances detection effectiveness with practical deployment constraints, making it an excellent foundation for system-agent implementations.