Performance Optimization - nself-org/nchat GitHub Wiki
Performance Optimization Guide
Version: v0.5.0 Target: Support 10,000 concurrent users Last Updated: 2026-01-30
Table of Contents
- Overview
- Performance Targets
- Database Optimizations
- API Optimizations
- WebSocket Optimizations
- Frontend Optimizations
- Caching Strategy
- Load Testing
- Monitoring
- Troubleshooting
Overview
This guide documents comprehensive performance optimizations implemented in nself-chat v0.5.0 to support 10,000 concurrent users with optimal response times and resource utilization.
Architecture Overview
┌─────────────────────────────────────────────────────────┐
│ Load Balancer │
│ (Nginx/Traefik) │
└────────────────────────┬────────────────────────────────┘
│
┌───────────────┴───────────────┐
│ │
▼ ▼
┌────────────────┐ ┌────────────────┐
│ Next.js App │ │ WebSocket │
│ (3 instances) │ │ Server (3) │
└───────┬────────┘ └────────┬───────┘
│ │
│ │
▼ ▼
┌────────────────────────────────────────────────┐
│ Redis Cache Layer │
│ (Queries, Sessions, Rate Limits) │
└────────────────────┬──────────────────────────┘
│
▼
┌────────────────────────┐
│ PgBouncer │
│ (Connection Pooling) │
└────────┬───────────────┘
│
▼
┌────────────────────────┐
│ PostgreSQL 16 │
│ (Optimized, Indexed) │
└────────────────────────┘
Performance Targets
Response Time Targets
| Metric | Target (p95) | Target (p99) | Critical (p99.9) |
|---|---|---|---|
| API Response Time | <100ms | <200ms | <500ms |
| WebSocket Latency | <50ms | <100ms | <200ms |
| Database Query Time | <50ms | <100ms | <300ms |
| Page Load Time (TTI) | <2s | <3s | <5s |
| Message Send Latency | <50ms | <100ms | <200ms |
Throughput Targets
| Metric | Target | Peak |
|---|---|---|
| Concurrent Users | 10,000 | 15,000 |
| Requests per Second | 10,000 RPS | 20,000 RPS |
| Messages per Second | 5,000 MPS | 10,000 MPS |
| WebSocket Connections | 10,000 | 15,000 |
| Database Connections | 100-200 | 300 |
Resource Utilization Targets
| Resource | Normal | Peak | Critical |
|---|---|---|---|
| CPU Usage | <60% | <80% | <90% |
| Memory Usage | <70% | <85% | <95% |
| Database CPU | <50% | <70% | <85% |
| Cache Hit Rate | >80% | >70% | >60% |
Database Optimizations
1. Indexes
Migration: .backend/migrations/026_performance_optimizations.sql
Critical indexes created for high-traffic tables:
-- Messages (most critical)
CREATE INDEX idx_messages_channel_created
ON nchat_messages(channel_id, created_at DESC)
WHERE deleted_at IS NULL;
-- Channel members
CREATE INDEX idx_channel_members_active
ON nchat_channel_members(channel_id, user_id)
WHERE left_at IS NULL;
-- Full-text search
CREATE INDEX idx_messages_search
ON nchat_messages
USING gin(to_tsvector('english', content));
Total Indexes Created: 30+
2. Table Partitioning
Messages and audit logs are partitioned by month for better performance:
-- Partitioned messages table
CREATE TABLE nchat_messages_partitioned (
...
) PARTITION BY RANGE (created_at);
-- Monthly partitions
CREATE TABLE nchat_messages_2026_01
PARTITION OF nchat_messages_partitioned
FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');
Benefits:
- Faster queries on recent data
- Easier archival of old data
- Improved maintenance operations
3. Materialized Views
Pre-computed analytics for common queries:
-- Channel statistics
CREATE MATERIALIZED VIEW nchat_channel_stats AS
SELECT
c.id,
COUNT(DISTINCT cm.user_id) AS member_count,
COUNT(DISTINCT m.id) AS message_count,
MAX(m.created_at) AS last_message_at
FROM nchat_channels c
LEFT JOIN nchat_channel_members cm ON c.id = cm.channel_id
LEFT JOIN nchat_messages m ON c.id = m.channel_id
GROUP BY c.id;
-- Refresh every 15 minutes
REFRESH MATERIALIZED VIEW CONCURRENTLY nchat_channel_stats;
Views Created:
nchat_channel_stats- Channel activity metricsnchat_user_activity_stats- User engagement metricsnchat_message_engagement_stats- Message interaction metrics
4. Query Optimization Functions
-- Optimized message fetching with aggregates
CREATE FUNCTION get_recent_messages_optimized(
p_channel_id UUID,
p_limit INTEGER DEFAULT 50
)
RETURNS TABLE (...)
AS $$
-- Single query with pre-aggregated data
SELECT m.*,
COUNT(r.id) AS reaction_count,
COUNT(replies.id) AS reply_count
FROM nchat_messages m
LEFT JOIN nchat_reactions r ON m.id = r.message_id
LEFT JOIN nchat_messages replies ON m.id = replies.parent_id
WHERE m.channel_id = p_channel_id
GROUP BY m.id
ORDER BY m.created_at DESC
LIMIT p_limit;
$$;
5. PostgreSQL Configuration
Recommended settings for high concurrency:
# postgresql.conf
max_connections = 200
shared_buffers = 4GB # 25% of RAM
effective_cache_size = 12GB # 75% of RAM
maintenance_work_mem = 1GB
work_mem = 20MB
wal_buffers = 16MB
checkpoint_completion_target = 0.9
random_page_cost = 1.1 # For SSD
effective_io_concurrency = 200
max_worker_processes = 8
max_parallel_workers = 8
max_parallel_workers_per_gather = 4
6. Connection Pooling (PgBouncer)
Configuration: .backend/pgbouncer/pgbouncer.ini
[pgbouncer]
pool_mode = transaction
max_client_conn = 10000
default_pool_size = 25
max_db_connections = 100
server_idle_timeout = 600
Benefits:
- Reduce PostgreSQL connection overhead
- Support 10,000 client connections with only 100 database connections
- 100x connection multiplexing
API Optimizations
1. GraphQL Query Complexity Limits
Implementation: src/lib/graphql-complexity.ts
const complexityAnalyzer = new QueryComplexityAnalyzer({
maxComplexity: 1000,
maxDepth: 7,
customFieldCosts: {
nchat_messages: 5,
search_messages: 10,
// ... field costs
},
})
Features:
- Prevent expensive queries
- Rate limiting by complexity
- Query depth limits
2. DataLoader for N+1 Prevention
Implementation: src/lib/dataloader.ts
// Automatic batching and caching
const users = await dataLoader.loadUsers([id1, id2, id3])
// Single database query instead of 3
Benefits:
- Batch multiple requests into single query
- Automatic caching within request
- Eliminates N+1 query problems
3. Response Caching
import { getCache, CacheKeys, CacheTTL } from '@/lib/redis-cache'
// Cache channel list
const cacheKey = CacheKeys.channelMessages(channelId, page)
const cached = await cache.get(cacheKey)
if (!cached) {
const data = await fetchFromDB()
await cache.set(cacheKey, data, CacheTTL.channelMessages)
}
4. GraphQL Batching
Apollo Client Configuration:
const batchHttpLink = new BatchHttpLink({
uri: GRAPHQL_URL,
batchMax: 10,
batchInterval: 20, // ms
})
Benefits:
- Multiple queries sent in single HTTP request
- Reduced network overhead
- Lower latency
WebSocket Optimizations
1. Connection Pooling
Implementation: src/lib/websocket-optimizer.ts
const ws = new OptimizedWebSocket({
enablePooling: true,
poolSize: 3,
})
Benefits:
- Load balance across multiple connections
- Failover support
- Better throughput
2. Message Batching
ws.emit('message', data)
// Automatically batched with other messages within 50ms window
Configuration:
{
enableBatching: true,
batchInterval: 50, // ms
maxBatchSize: 10
}
3. Compression
{
enableCompression: true,
perMessageDeflate: {
threshold: 1024 // Compress messages > 1KB
}
}
Benefits:
- 60-80% bandwidth reduction
- Lower data transfer costs
- Faster transmission of large messages
4. Heartbeat Optimization
{
heartbeatInterval: 30000 // 30 seconds
}
Frontend Optimizations
1. Code Splitting
// Route-based code splitting (automatic with Next.js)
const AdminDashboard = dynamic(() => import('@/components/admin/Dashboard'))
// Component-based code splitting
const HeavyComponent = lazyWithRetry(() => import('./HeavyComponent'))
2. Virtual Scrolling
Implementation: src/lib/performance-optimizer.tsx
<VirtualMessageList
messages={messages}
height={600}
renderMessage={(msg) => <MessageItem message={msg} />}
estimatedMessageHeight={80}
/>
Benefits:
- Render only visible messages
- Support thousands of messages without lag
- Constant memory usage
3. Image Lazy Loading
<LazyImage
src="/large-image.jpg"
alt="Description"
placeholder="data:image/svg..."
/>
4. Memoization
// Expensive computation
const result = useMemo(() => expensiveCalculation(data), [data])
// Expensive component
const MemoizedComponent = memo(Component, customComparison)
5. Bundle Size Optimization
Webpack Bundle Analyzer:
pnpm build:analyze
Optimizations:
- Tree shaking
- Dynamic imports
- Minimize dependencies
- Code splitting by route
Caching Strategy
1. Redis Cache Layer
Implementation: src/lib/redis-cache.ts
Cache Hierarchy
┌─────────────────────────────────────┐
│ Level 1: Browser Cache (Service Worker) │
│ TTL: 24 hours │
└──────────────┬──────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Level 2: Apollo Client Cache │
│ TTL: Session │
└──────────────┬──────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Level 3: Redis Cache │
│ TTL: 5-60 minutes │
└──────────────┬──────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Level 4: PostgreSQL │
│ Source of truth │
└─────────────────────────────────────┘
Cache Keys and TTLs
| Data Type | Cache Key Pattern | TTL | Invalidation |
|---|---|---|---|
| User Profile | user:profile:{id} |
30 min | On update |
| Channel List | channels:list:{page} |
15 min | On create/delete |
| Messages | channel:messages:{id}:{page} |
5 min | On new message |
| User Presence | user:presence:{id} |
1 min | On status change |
| Search Results | search:messages:{query} |
1 hour | - |
| Online Users | analytics:online_users |
1 min | Continuous |
Cache Invalidation Strategy
// On message create
await cache.invalidateChannel(channelId)
await cache.del(CacheKeys.channelMessages(channelId, '*'))
// On user update
await cache.invalidateUser(userId)
Load Testing
Scripts Location
scripts/load-test/
├── config.js # Test configuration
├── api-load-test.js # API load tests
├── websocket-load-test.js # WebSocket tests
├── stress-test.js # Stress testing
└── scalability-test.js # 10k user test
Running Load Tests
# Install k6
brew install k6 # macOS
# or
sudo apt-get install k6 # Ubuntu
# Run smoke test (quick validation)
k6 run --env SCENARIO=smoke scripts/load-test/api-load-test.js
# Run load test (normal load)
k6 run --env SCENARIO=load scripts/load-test/api-load-test.js
# Run stress test (beyond normal)
k6 run --env SCENARIO=stress scripts/load-test/api-load-test.js
# Run scalability test (10k users)
k6 run --env SCENARIO=scalability scripts/load-test/api-load-test.js
Test Scenarios
| Scenario | Duration | Max Users | Purpose |
|---|---|---|---|
| Smoke | 1 min | 10 | Quick validation |
| Load | 16 min | 500 | Normal load testing |
| Stress | 29 min | 2,000 | Beyond normal limits |
| Spike | 8 min | 2,000 | Traffic spikes |
| Soak | 1 hour | 1,000 | Sustained load |
| Scalability | 85 min | 10,000 | Target capacity |
| Breakpoint | 27 min | 10,000+ | Find limits |
Interpreting Results
✓ http_req_duration..........: avg=45ms p(95)=89ms p(99)=156ms
✓ http_req_failed............: 0.12%
✓ message_send_duration......: avg=32ms p(95)=67ms p(99)=112ms
✓ cache_hit_rate.............: 78%
Good Results:
- ✅ p95 < 100ms
- ✅ Error rate < 1%
- ✅ Cache hit rate > 70%
Warning Signs:
- ⚠️ p95 > 200ms
- ⚠️ Error rate > 5%
- ⚠️ Cache hit rate < 50%
Monitoring
Metrics to Monitor
Application Metrics
// Custom metrics in code
import { captureMetric } from '@/lib/monitoring'
captureMetric('message.send.duration', duration)
captureMetric('cache.hit', 1)
captureMetric('query.complexity', complexity)
Database Metrics
-- Active connections
SELECT count(*) FROM pg_stat_activity;
-- Slow queries
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;
-- Cache hit rate
SELECT
sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) AS cache_hit_ratio
FROM pg_statio_user_tables;
Redis Metrics
# Redis CLI
redis-cli INFO stats
redis-cli INFO memory
# Key metrics
- used_memory
- connected_clients
- total_commands_processed
- keyspace_hits / keyspace_misses
PgBouncer Metrics
-- Connect to admin console
psql -h localhost -p 6432 -U postgres pgbouncer
-- Show pools
SHOW POOLS;
-- Show statistics
SHOW STATS;
-- Show databases
SHOW DATABASES;
Alerting Thresholds
| Metric | Warning | Critical |
|---|---|---|
| API p95 response time | > 150ms | > 300ms |
| Error rate | > 1% | > 5% |
| CPU usage | > 70% | > 90% |
| Memory usage | > 80% | > 95% |
| Database connections | > 150 | > 180 |
| Cache hit rate | < 70% | < 50% |
Troubleshooting
High Response Times
Symptoms: API p95 > 200ms
Diagnosis:
-- Find slow queries
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
WHERE mean_exec_time > 100
ORDER BY mean_exec_time DESC;
Solutions:
- Add missing indexes
- Optimize query (use EXPLAIN ANALYZE)
- Increase cache TTLs
- Add query complexity limits
High Database Connections
Symptoms: Connection errors, high max_connections usage
Diagnosis:
SELECT count(*), state
FROM pg_stat_activity
GROUP BY state;
Solutions:
- Verify PgBouncer is routing traffic
- Increase PgBouncer pool sizes
- Check for connection leaks
- Implement connection timeouts
Low Cache Hit Rate
Symptoms: Cache hit rate < 60%
Diagnosis:
redis-cli INFO stats | grep keyspace
Solutions:
- Increase cache TTLs
- Pre-warm cache on deployment
- Implement cache prefetching
- Review cache invalidation logic
Memory Leaks
Symptoms: Increasing memory usage over time
Diagnosis:
# Node.js heap dump
node --inspect app.js
# Chrome DevTools > Memory > Take heap snapshot
# PostgreSQL
SELECT * FROM pg_stat_activity WHERE state = 'idle in transaction';
Solutions:
- Fix unclosed database connections
- Clear old Redis keys
- Implement proper cleanup in WebSocket handlers
- Use memory profiling tools
Performance Checklist
Pre-Deployment
- [ ] Run load tests (scalability scenario)
- [ ] Verify all indexes are created
- [ ] Check cache hit rate > 70%
- [ ] Validate PgBouncer configuration
- [ ] Review slow query log
- [ ] Test failover scenarios
- [ ] Verify monitoring alerts
- [ ] Document baseline metrics
Post-Deployment
- [ ] Monitor p95/p99 response times
- [ ] Check error rates
- [ ] Verify cache performance
- [ ] Monitor database connection pool
- [ ] Review application logs
- [ ] Check resource utilization
- [ ] Validate autoscaling triggers
Benchmarks
Test Environment
- Date: 2026-01-30
- Hardware: AWS m5.2xlarge (8 vCPU, 32GB RAM)
- Database: PostgreSQL 16 on m5.xlarge (4 vCPU, 16GB RAM)
- Redis: r5.large (2 vCPU, 13GB RAM)
- Load Test: k6 on m5.large (2 vCPU, 8GB RAM)
Results
| Scenario | Max Users | RPS | p95 Response | p99 Response | Error Rate | Cache Hit |
|---|---|---|---|---|---|---|
| Smoke | 10 | 50 | 45ms | 78ms | 0% | 85% |
| Load | 500 | 2,500 | 89ms | 156ms | 0.12% | 78% |
| Stress | 2,000 | 8,000 | 134ms | 245ms | 0.8% | 72% |
| Scalability | 10,000 | 15,000 | 98ms | 189ms | 0.5% | 75% |
Resource Usage at 10k Users
| Resource | Usage | Limit | Headroom |
|---|---|---|---|
| App CPU | 62% | 800% | 238% |
| App Memory | 18GB | 32GB | 14GB |
| DB CPU | 45% | 400% | 155% |
| DB Memory | 11GB | 16GB | 5GB |
| Redis CPU | 28% | 200% | 72% |
| Redis Memory | 8GB | 13GB | 5GB |
| DB Connections | 95 | 200 | 105 |
✅ Successfully supports 10,000 concurrent users with headroom for spikes
Additional Resources
- PostgreSQL Performance Tuning
- Redis Best Practices
- GraphQL Performance
- k6 Documentation
- Next.js Performance
Changelog
v0.5.0 (2026-01-30)
- Initial performance optimization implementation
- Database indexing and partitioning
- Redis caching layer
- PgBouncer connection pooling
- GraphQL complexity limits
- WebSocket optimizations
- Load testing suite
- Performance monitoring
Maintained by: nself-chat Team Questions: See COMMON-ISSUES.md