Memory Technologies Platform Specific Pixie - antimetal/system-agent GitHub Wiki

Pixie

Overview

Pixie is an open-source Kubernetes-native observability platform that provides instant, zero-instrumentation monitoring and debugging capabilities. Originally developed by Pixie Labs and later acquired by New Relic in 2020, Pixie was contributed to the Cloud Native Computing Foundation (CNCF) as a Sandbox project in June 2021.

Key characteristics:

  • Kubernetes-native observability platform built specifically for container environments
  • Automatic eBPF instrumentation with no code changes required
  • Zero manual instrumentation - automatically instruments applications as soon as they start
  • Language agnostic - works with any programming language or framework
  • Local data processing - all telemetry data remains within the cluster
  • Now part of New Relic with both open-source and managed editions available

Performance Characteristics

  • Overhead: 2-5% CPU typical, often under 2%
  • Memory Requirements: Minimum 1GiB per node, 2GiB recommended
  • Accuracy: High - captures full-body requests and responses
  • False Positives: Low - eBPF provides accurate kernel-level data
  • Production Ready: Yes - designed for production environments
  • Platform: Kubernetes only
  • Java Profiler Overhead: Ultra-low < 0.1% for continuous profiling

Architecture

Pixie employs a unique edge computing architecture that processes data locally within Kubernetes clusters:

Core Components

  1. Vizier (Control Plane)

    • Manages Pixie Edge Modules (PEMs)
    • Handles query orchestration and metadata management
    • Coordinates data collection and processing
  2. Pixie Edge Modules (PEMs)

    • Deployed as DaemonSet on each node
    • Collect telemetry data using eBPF probes
    • Process and store data locally in-memory
    • Default memory allocation: 2GiB per PEM
  3. eBPF Probes

    • Kernel-level instrumentation
    • Automatic application discovery and monitoring
    • Captures network traffic, system calls, and application metrics
    • No application code changes required

Data Processing Pipeline

  • Collection: eBPF probes capture telemetry at kernel level
  • Processing: Edge modules process data locally on each node
  • Storage: In-memory data tables with configurable retention
  • Querying: PxL scripts execute distributed queries across the cluster

Data Retention Model

  • Local Storage: All data stored in-memory on cluster nodes
  • No External Dependencies: No data sent outside the cluster by default
  • Memory Allocation: 60% for data storage, 40% for collection
  • Retention Period: Configurable, typically minutes to hours

System-Agent Implementation Plan

Kubernetes Cluster Requirements

Minimum Requirements:

  • Kubernetes 1.16+
  • Linux kernel 4.14+ (for eBPF support)
  • At least 1GiB memory per node
  • CPU architecture: x86_64 or ARM64

Recommended Configuration:

  • Kubernetes 1.20+
  • Linux kernel 5.4+ (optimal eBPF features)
  • 2GiB+ memory per node
  • Nodes should have < 25% memory utilization before Pixie installation

Deployment Options

1. CLI Installation (Recommended)

# Install Pixie CLI
bash -c "$(curl -fsSL https://withpixie.ai/install.sh)"

# Deploy to cluster
px deploy --cluster_name=my-cluster

2. Helm Chart Deployment

# Add Pixie Helm repo
helm repo add pixie https://pixie-operator-charts.storage.googleapis.com

# Install Pixie
helm install pixie pixie/pixie-chart \
  --set deployKey=$PIXIE_DEPLOY_KEY \
  --set clusterName=my-cluster

3. Kubectl Manifest

# Apply Pixie operator
kubectl apply -f https://pixie-operator-charts.storage.googleapis.com/latest/pixie_operator.yaml

# Create Vizier custom resource
kubectl apply -f pixie-vizier.yaml

Resource Requirements

Per Node Requirements:

  • CPU: 100-200m reserved, up to 1000m limit
  • Memory: 1-2Gi limit (2Gi recommended for production)
  • Storage: Minimal - uses in-memory storage
  • Network: Access to Pixie cloud services (for managed version)

Cluster-Level Resources:

  • Vizier: 1Gi memory, 1000m CPU
  • PEMs: Scale with node count
  • Total Overhead: ~2-5% of cluster resources

Key Features

Automatic Service Mapping

  • Zero-configuration service discovery across the cluster
  • Real-time service topology visualization
  • Dependency mapping between microservices
  • Traffic flow analysis with request/response patterns

Request Tracing

  • Full-body request/response capture for supported protocols
  • Unsampled distributed tracing without instrumentation
  • Protocol support: HTTP/HTTPS, gRPC, DNS, MySQL, PostgreSQL, Redis, Kafka, Cassandra, AMQP
  • Real-time traffic inspection with filtering capabilities

CPU and Memory Profiling

  • Continuous profiling with flame graphs
  • Zero-instrumentation profiling for all languages
  • CPU hotspot identification without recompilation
  • Memory usage tracking and leak detection capabilities
  • Call stack analysis with line-level precision

Network Monitoring

  • Layer 7 protocol analysis without sidecars
  • Network policy validation and traffic visualization
  • Ingress/egress traffic monitoring with full payload capture
  • DNS query analysis and resolution tracking

No Sidecars Needed

  • eBPF-based collection eliminates sidecar containers
  • Reduced resource overhead compared to proxy-based solutions
  • Simplified deployment model with DaemonSet architecture
  • Automatic protocol detection and parsing

Production Deployments

Used by Major Companies

Pixie is deployed in production by organizations including:

  • Technology companies for microservices debugging
  • Financial services for real-time transaction monitoring
  • E-commerce platforms for performance optimization
  • Media companies for streaming service analysis

Kubernetes-Specific Advantages

  • Native Kubernetes integration with CRD-based management
  • Pod-aware monitoring with automatic service discovery
  • Namespace isolation and multi-tenancy support
  • RBAC integration for secure access control
  • Helm chart deployment for GitOps workflows

Scale Considerations

  • Linear scaling with node count
  • Memory-bound scaling based on traffic volume
  • Query performance optimized for distributed execution
  • Data locality ensures consistent performance

Success Stories

  • 99.9% uptime achieved with proactive monitoring
  • 50% reduction in mean time to resolution (MTTR)
  • Zero application changes required for comprehensive observability
  • Cross-team collaboration improved with shared debugging interface

Installation

CLI Installation

# Install Pixie CLI
bash -c "$(curl -fsSL https://withpixie.ai/install.sh)"

# Authenticate (for managed version)
px auth login

# Deploy to current kubectl context
px deploy --cluster_name=production-cluster

# Verify installation
px get viziers

Helm Chart Deployment

# values.yaml
deployKey: "your-deploy-key-here"
clusterName: "production-cluster"

vizier:
  pemMemoryLimit: "2Gi"
  dataAccess: "Full"

operator:
  image:
    tag: "latest"
helm install pixie pixie/pixie-chart -f values.yaml

Resource Requirements Planning

# Calculate memory requirements
# Formula: (Number of nodes) × (2Gi per PEM) + 1Gi (Vizier)
# Example 10-node cluster: (10 × 2Gi) + 1Gi = 21Gi total

# CPU requirements
# Formula: (Number of nodes) × (200m baseline) + spikes up to 1000m

Security Considerations

  • Network policies to restrict Pixie component communication
  • RBAC configuration for user access control
  • TLS encryption for all inter-component communication
  • Data residency - all telemetry stays within cluster
  • Audit logging integration with Kubernetes audit system

PxL Scripts

PxL (Pixie Language) is a domain-specific language based on Python/Pandas syntax for querying and analyzing telemetry data.

Memory Leak Detection Scripts

Basic Memory Usage Monitoring

import px

# Query memory usage over time
df = px.DataFrame(table='process_stats', start_time='-5m')

# Filter for specific service
df = df[df.ctx['service'] == 'my-service']

# Aggregate memory usage by pod
memory_stats = df.groupby(['pod']).agg(
    avg_memory_mb=('vsize_mb', px.mean),
    max_memory_mb=('vsize_mb', px.max),
    memory_growth=('vsize_mb', px.last) - ('vsize_mb', px.first)
)

px.display(memory_stats)

Memory Growth Detection

import px

# Detect memory growth trends
df = px.DataFrame(table='process_stats', start_time='-30m')

# Calculate memory growth rate
df.memory_growth_rate = (df.vsize_mb - df.vsize_mb.shift()) / df.vsize_mb.shift()

# Flag potential memory leaks (>5% growth rate)
leaks = df[df.memory_growth_rate > 0.05]

# Group by service and pod
leak_summary = leaks.groupby(['service', 'pod']).agg(
    avg_growth_rate=('memory_growth_rate', px.mean),
    peak_memory_mb=('vsize_mb', px.max)
)

px.display(leak_summary)

Custom Monitoring Scripts

Service Health Dashboard

import px

# Multi-dimensional service health
df = px.DataFrame(table='http_events', start_time='-10m')

health_metrics = df.groupby('service').agg(
    request_count=('latency_ns', px.count),
    avg_latency_ms=('latency_ns', px.mean) / 1000000,
    error_rate=px.select(px.equals(px.floor(df.resp_status/100), 5)).mean(),
    memory_usage_mb=('', lambda: px.DataFrame(table='process_stats')
                     .groupby('service')['vsize_mb'].mean())
)

px.display(health_metrics)

Data Export Scripts

import px

# Export metrics for external analysis
def export_metrics(service_name, duration='-1h'):
    """Export comprehensive metrics for a service"""
    
    # HTTP metrics
    http_df = px.DataFrame(table='http_events', start_time=duration)
    http_df = http_df[http_df.ctx['service'] == service_name]
    
    # Process metrics  
    proc_df = px.DataFrame(table='process_stats', start_time=duration)
    proc_df = proc_df[proc_df.ctx['service'] == service_name]
    
    # Network metrics
    net_df = px.DataFrame(table='conn_stats', start_time=duration)
    net_df = net_df[net_df.ctx['service'] == service_name]
    
    return {
        'http_metrics': http_df,
        'process_metrics': proc_df,  
        'network_metrics': net_df
    }

Code Examples

API Usage

import pxapi

# Connect to Pixie cluster
conn = pxapi.Client()

# Execute PxL script
script = """
import px
df = px.DataFrame(table='http_events', start_time='-5m')
px.display(df.groupby('service').agg(
    request_count=('latency_ns', px.count),
    avg_latency=('latency_ns', px.mean)
))
"""

results = conn.execute_script(script)
print(results)

Integration Patterns

# Kubernetes CronJob for periodic analysis
apiVersion: batch/v1
kind: CronJob
metadata:
  name: pixie-memory-analysis
spec:
  schedule: "*/15 * * * *"  # Every 15 minutes
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: analyzer
            image: pixie-analyzer:latest
            command: ["python", "memory_leak_detector.py"]
            env:
            - name: PIXIE_API_KEY
              valueFrom:
                secretKeyRef:
                  name: pixie-credentials
                  key: api-key

Automated Monitoring

#!/usr/bin/env python3
"""
Automated Pixie memory monitoring with alerting
"""
import pxapi
import time
import smtplib
from datetime import datetime

class PixieMemoryMonitor:
    def __init__(self, cluster_id):
        self.conn = pxapi.Client(cluster_id=cluster_id)
        self.memory_threshold = 90  # Percent
        
    def check_memory_usage(self):
        script = """
        import px
        df = px.DataFrame(table='process_stats', start_time='-5m')
        
        memory_stats = df.groupby(['service', 'pod']).agg(
            current_memory_mb=('vsize_mb', px.last),
            memory_limit_mb=('memory_limit_bytes', px.last) / 1024 / 1024
        )
        
        memory_stats.memory_usage_pct = (
            memory_stats.current_memory_mb / memory_stats.memory_limit_mb * 100
        )
        
        px.display(memory_stats[memory_stats.memory_usage_pct > 80])
        """
        
        return self.conn.execute_script(script)
    
    def send_alert(self, high_memory_services):
        # Alert logic here
        pass
    
    def run_monitoring_loop(self):
        while True:
            results = self.check_memory_usage()
            if len(results) > 0:
                self.send_alert(results)
            time.sleep(300)  # 5-minute intervals

if __name__ == "__main__":
    monitor = PixieMemoryMonitor("prod-cluster")
    monitor.run_monitoring_loop()

Monitoring & Alerting

Memory Growth Patterns

# PxL script for identifying memory growth patterns
import px

def detect_memory_patterns(service_name, lookback='-2h'):
    """Detect memory allocation patterns and potential leaks"""
    
    df = px.DataFrame(table='process_stats', start_time=lookback)
    df = df[df.ctx['service'] == service_name]
    
    # Calculate moving averages
    df.memory_ma_5m = df.vsize_mb.rolling('5m').mean()
    df.memory_ma_15m = df.vsize_mb.rolling('15m').mean()
    
    # Identify growth trends
    df.growth_trend = (df.memory_ma_5m > df.memory_ma_15m)
    
    # Memory leak indicators
    leak_indicators = df.groupby('pod').agg(
        sustained_growth=('growth_trend', px.sum),
        max_memory_mb=('vsize_mb', px.max),
        memory_variance=('vsize_mb', px.var)
    )
    
    return leak_indicators[leak_indicators.sustained_growth > 5]

Service-Level Monitoring

# Comprehensive service health monitoring
import px

def service_health_check(service_filter=''):
    """Generate comprehensive service health report"""
    
    # HTTP performance metrics
    http_df = px.DataFrame(table='http_events', start_time='-15m')
    if service_filter:
        http_df = http_df[http_df.ctx['service'].contains(service_filter)]
    
    http_metrics = http_df.groupby('service').agg(
        request_count=('latency_ns', px.count),
        avg_latency_ms=('latency_ns', px.mean) / 1000000,
        p99_latency_ms=('latency_ns', px.quantile, 0.99) / 1000000,
        error_rate=px.equals(px.floor(http_df.resp_status/100), 5).mean()
    )
    
    # Memory metrics
    proc_df = px.DataFrame(table='process_stats', start_time='-15m')
    if service_filter:
        proc_df = proc_df[proc_df.ctx['service'].contains(service_filter)]
    
    memory_metrics = proc_df.groupby('service').agg(
        avg_memory_mb=('vsize_mb', px.mean),
        max_memory_mb=('vsize_mb', px.max),
        cpu_usage_pct=('cpu_usage_pct', px.mean)
    )
    
    # Combine metrics
    combined = http_metrics.merge(memory_metrics, on='service', how='outer')
    
    px.display(combined)

Alert Integration

Prometheus Integration

# ServiceMonitor for Pixie metrics export
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: pixie-metrics
spec:
  selector:
    matchLabels:
      app: pixie-prometheus-exporter
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

Custom Webhook Alerts

import requests
import json

def send_pixie_alert(service, memory_usage, threshold):
    """Send alert to webhook endpoint"""
    
    alert_data = {
        "text": f"Memory Alert: {service}",
        "attachments": [{
            "color": "danger",
            "fields": [{
                "title": "Service",
                "value": service,
                "short": True
            }, {
                "title": "Memory Usage",
                "value": f"{memory_usage}MB ({memory_usage/threshold*100:.1f}%)",
                "short": True
            }]
        }]
    }
    
    requests.post(
        "https://hooks.slack.com/services/YOUR/WEBHOOK/URL",
        json=alert_data
    )

Baseline Establishment

# Establish performance baselines
import px
from datetime import datetime, timedelta

def establish_baseline(service_name, days_back=7):
    """Establish performance baseline over historical period"""
    
    # Query historical data
    lookback = f'-{days_back}d'
    df = px.DataFrame(table='process_stats', start_time=lookback)
    df = df[df.ctx['service'] == service_name]
    
    # Calculate baseline metrics
    baseline = df.groupby(['service', 'hour']).agg(
        baseline_memory_mb=('vsize_mb', px.mean),
        memory_p95=('vsize_mb', px.quantile, 0.95),
        memory_stddev=('vsize_mb', px.std),
        baseline_cpu_pct=('cpu_usage_pct', px.mean)
    )
    
    return baseline

Comparison with Alternatives

vs Parca: Kubernetes-Specific Features

Feature Pixie Parca
Scope Full observability platform Continuous profiling focused
Data Coverage Metrics, traces, logs, profiles CPU/memory profiling only
Protocol Support HTTP, gRPC, DNS, MySQL, etc. Not applicable
Service Discovery Automatic Kubernetes-native Manual configuration
Query Language PxL (Pythonic) PromQL-style queries
Storage In-memory, ephemeral Persistent storage
Deployment DaemonSet + Operator Single binary deployment

When to Choose Pixie:

  • Need comprehensive observability beyond profiling
  • Want zero-instrumentation application monitoring
  • Require real-time debugging capabilities
  • Need service topology and dependency mapping

When to Choose Parca:

  • Focus specifically on continuous profiling
  • Need long-term profile data retention
  • Want lightweight profiling-only solution
  • Require detailed code-level analysis

vs Traditional APM: No Instrumentation Advantage

Pixie Advantages:

  • Zero code changes required for deployment
  • Language agnostic - works with any runtime
  • Real-time insights without sampling
  • Full request/response capture for debugging
  • No performance impact from instrumentation libraries

Traditional APM Limitations:

  • Requires SDK integration and code changes
  • Language-specific instrumentation overhead
  • Sampling can miss critical events
  • Limited visibility into system-level interactions
  • Deployment complexity with legacy applications

Advantages in Kubernetes Environments

  1. Native Integration

    • Built specifically for Kubernetes architecture
    • Understands pods, services, and namespaces natively
    • Automatic service discovery and mapping
  2. eBPF Capabilities

    • Kernel-level visibility without application changes
    • Network traffic analysis at Layer 7
    • System call and resource monitoring
  3. Edge Computing Architecture

    • Data processing happens locally in cluster
    • No external dependencies for basic functionality
    • Reduced latency and improved security
  4. Developer Experience

    • Instant debugging without redeployment
    • Interactive query interface (Live UI)
    • Collaborative debugging with team access

Repository & Documentation

Primary Resources

New Relic Integration

Community Resources

Available Editions

  1. Pixie Core (Open Source)

    • Self-managed deployment
    • Full observability capabilities
    • Community support
  2. Pixie by New Relic (Managed)

    • Fully managed service
    • New Relic One integration
    • Enterprise support
  3. Pixie Enterprise Edition

    • Industry-specific compliance features
    • Advanced security controls
    • Professional services support

Getting Started Resources


Last updated: 2024