Analytics Architecture - bcgov/eagle-dev-guides GitHub Wiki

Analytics Architecture - Penguin Analytics Integration

Overview

The EPIC platform integrates with Penguin Analytics, a dedicated microservice for capturing user interaction events and generating insights through time-series analysis. This page covers EPIC-specific integration patterns. For complete Penguin Analytics architecture, see the Penguin Analytics wiki.

Integration Overview

graph TB
    User["User Interactions<br/>eagle-admin / eagle-public"]
    
    Rproxy["rproxy (nginx)<br/>Routing layer"]
    
    EagleAPI["eagle-api<br/>Port 3000<br/>MongoDB"]
    
    Analytics["penguin-analytics-api<br/>Port 3000<br/>TimescaleDB"]
    
    Metabase["Metabase<br/>Analytics dashboards"]
    
    User -->|"/api/*"| Rproxy
    User -->|"/analytics"| Rproxy
    Rproxy -->|proxy_pass| EagleAPI
    Rproxy -->|proxy_pass| Analytics
    Analytics -->|SQL queries| Metabase
    
    style User fill:#e3f2fd
    style Rproxy fill:#f3e5f5
    style EagleAPI fill:#fff4e6
    style Analytics fill:#e8f5e9
    style Metabase fill:#fce4ec
Loading

Local Dev Routing

In local development, proxy.conf.js routes both /api and /analytics to eagle-api. Eagle-api has an /analytics Express route that proxies requests to penguin-analytics (via ANALYTICS_SERVICE_URL, default http://localhost:3001):

graph LR
    Browser["Browser :4200"] -->|"/api/*"| DevServer["Angular Dev Server<br/>proxy.conf.js"]
    Browser -->|"/analytics"| DevServer
    DevServer -->|proxy| EagleAPI["eagle-api :3000"]
    EagleAPI -->|"/analytics proxy"| PA["penguin-analytics :3001"]
    style Browser fill:#e3f2fd
    style DevServer fill:#f3e5f5
    style EagleAPI fill:#fff4e6
    style PA fill:#e8f5e9
Loading

Why /analytics is Separate from /api

Key Reasons

1. Microservice Independence

  • Penguin Analytics is a completely separate service:
    • Different repository: digitalspace/penguin-analytics
    • Different database: TimescaleDB (not MongoDB)
    • Independent deployment lifecycle and versioning
    • Independent scaling characteristics

2. No Authentication Required

  • Analytics endpoints are anonymous and don't require Keycloak JWT
  • eagle-public: Fully anonymous tracking (no user identification)
  • eagle-admin: Tracks authenticated users but stores anonymized GUIDs
  • Separating from /api clarifies that no auth headers are needed

3. Performance Isolation

  • Analytics is write-heavy with high-volume event ingestion
  • Time-series database optimized for inserts (not MongoDB's use case)
  • Analytics failures should not impact core API operations
  • Independent scaling based on event volume, not API traffic

4. Technology Choice

  • TimescaleDB: Purpose-built for time-series data (auto partitioning, compression, fast aggregations)
  • MongoDB: Document database optimized for EPIC project data (not time-series)

5. Same-Origin for Ad Blocker Bypass

  • Using /analytics on same domain prevents CORS preflight requests
  • Same-origin requests less likely to be blocked by ad blockers
  • Browser security features work seamlessly

Legacy /api/analytics Path

Prior to v2.4.1, env.js defaulted to ANALYTICS_API_URL = '/api/analytics'. This was corrected to /analytics but some clients cached the old env.js for up to 1 year. As of rproxy v1.0.5, both /analytics and /api/analytics route to penguin-analytics.

Comparison: /api vs /analytics

Aspect /api /analytics
Routing Direct OpenShift route Through rproxy
Authentication Keycloak JWT required No authentication
Service eagle-api (Node.js) penguin-analytics-api (Node.js)
Database MongoDB TimescaleDB (PostgreSQL)
Request Pattern Read/write CRUD operations Write-heavy event ingestion
Response Time 100-500ms (database queries) < 50ms (fire-and-forget)
Data Retention Permanent (project records) Time-series (compressed after 30 days)

EPIC Application Integration

eagle-public Configuration

Frontend configuration in src/env.js:

window.__env = {
  // ... other config
  ANALYTICS_API_URL: '/analytics',  // Same-origin path
  ANALYTICS_TRAFFIC_TRACKING: true, // UTM params, referrer, traffic channel
  ANALYTICS_ENHANCED_TRACKING: false // Browser fingerprinting (prod default)
};

Traffic Source Tracking (ANALYTICS_TRAFFIC_TRACKING):

  • Captures UTM parameters and referrer on first visit
  • Persists in localStorage via @analytics/original-source-plugin
  • Determines traffic channel: chatbot, direct, email, internal, referral, search, social, other
  • Data sent with Page Viewed events: traffic_channel, traffic_source, traffic_medium, etc.

Analytics service usage:

// Auto-tracked events:
// - Page Viewed (route changes)
// - Button Clicked (button elements)
// - Link Clicked (anchor links)
// - User Active (30-second heartbeat)

// Manual tracking for custom events:
this.analyticsService.track('Comment Submitted', { 
  projectId: 'abc123',
  commentLength: 250
});

eagle-admin Configuration

Additional step for user identification after login:

// After Keycloak authentication
this.analyticsService.identify(user.guid, { 
  username: user.username,
  roles: user.roles
});

// On logout
this.analyticsService.reset();  // Tracks "Session Ended" event

Important: eagle-admin uses Keycloak for authentication but analytics tracking is still sent to unauthenticated /analytics endpoint. User context is included in event properties, not as JWT headers.

Event Types

Standard events tracked across EPIC applications:

Event Triggered By Key Properties
User Identified Login (eagle-admin only) traits.username, traits.roles[]
Page Viewed Route navigation page_name, path, url
Button Clicked Button clicks button_text, section
Link Clicked Link clicks link_url, link_text
User Active 30-second heartbeat is_active, seconds_since_activity
Session Ended Logout (eagle-admin only) session_end

Traffic source properties (included in Page Viewed when ANALYTICS_TRAFFIC_TRACKING=true):

Property Description Example
traffic_channel Derived channel category search, social, direct
traffic_source UTM source or referrer google, facebook
traffic_medium UTM medium cpc, email, organic
traffic_campaign UTM campaign spring_sale
traffic_content UTM content banner_a
traffic_term UTM term environmental assessment

For complete event schema documentation, see Penguin Analytics Event Schema.

Metabase Dashboards

Analytics data is visualized through Metabase dashboards configured per application:

eagle-admin: Staff usage patterns, feature adoption, admin activity
eagle-public: Public traffic, popular projects, search trends

Configuration Files

Dashboards are defined in YAML configuration files:

  • scripts/configs/eagle-admin.yaml
  • scripts/configs/eagle-public.yaml

For dashboard configuration patterns and Metabase setup, see Penguin Analytics Metabase Configuration.

Two-Tier Privacy System

EPIC analytics uses a two-tier privacy model that gives operators granular control over data collection:

Architecture

flowchart TB
    subgraph Client["Client Tier (Browser)"]
        User[User visits eagle-public]
        Config[Fetch /api/config]
        Flag1{ANALYTICS_ENHANCED_TRACKING?}
        Enhanced[Send full browser context]
        Minimal[Send minimal data only]
    end
    
    subgraph Server["Server Tier (penguin-analytics)"]
        Receive[Receive event]
        Flag2{GEO_ENRICHMENT_ENABLED?}
        HasEnhanced{Has screen_width?}
        Enrich[Enrich with country/city/ISP]
        Store[Store in TimescaleDB]
    end
    
    User --> Config
    Config --> Flag1
    Flag1 -->|true| Enhanced
    Flag1 -->|false| Minimal
    Enhanced --> Receive
    Minimal --> Receive
    Receive --> Flag2
    Flag2 -->|true| HasEnhanced
    Flag2 -->|false| Store
    HasEnhanced -->|yes| Enrich
    HasEnhanced -->|no| Store
    Enrich --> Store
    
    style Client fill:#e3f2fd
    style Server fill:#e8f5e9
Loading

Privacy Flags

Tier Flag Service Controls
Client ANALYTICS_TRAFFIC_TRACKING eagle-public UTM params, referrer, traffic channel
Client ANALYTICS_ENHANCED_TRACKING eagle-api Browser fingerprinting: screen, device, network, timezone
Server GEO_ENRICHMENT_ENABLED penguin-analytics IP geolocation: country, city, ISP, ASN

Environment Configuration

Environment ANALYTICS_TRAFFIC_TRACKING ANALYTICS_ENHANCED_TRACKING GEO_ENRICHMENT_ENABLED
Dev true true true
Test true true true
Prod true false false

Important: Production defaults are privacy-safe. Enabling enhanced tracking requires explicit approval.

Data Collected

When ANALYTICS_ENHANCED_TRACKING=true (client-side):

{
  "url": "https://projects.eao.gov.bc.ca/p/abc123",
  "title": "Project Name",
  "referrer": "https://google.com",
  "screen_width": 3440, "screen_height": 1440,
  "viewport_width": 1720, "viewport_height": 900,
  "pixel_ratio": 2, "color_depth": 24,
  "platform": "Linux", "browser": "Brave", "browser_version": "120",
  "mobile": false, "touch_points": 0,
  "timezone": "America/Vancouver",
  "language": "en-CA",
  "connection_type": "4g", "connection_downlink": 10
}

When GEO_ENRICHMENT_ENABLED=true (server-side, added by penguin-analytics):

{
  "country": "CA", "country_name": "Canada",
  "region": "BC", "city": "Victoria",
  "isp": "TELUS Communications Inc.", "asn": 852,
  "client_ip_hash": "a1b2c3d4..."
}

Privacy mode (ANALYTICS_ENHANCED_TRACKING=false):

{
  "url": "/p/abc123",
  "title": "Project Name"
}

GeoIP Implementation

penguin-analytics uses MaxMind GeoLite2 databases for IP geolocation:

Architecture: InitContainer downloads databases on pod startup

  • Databases: GeoLite2-City.mmdb (54MB) + GeoLite2-ASN.mmdb (11MB)
  • Startup: ~60 seconds to download and extract
  • Updates: Automatic monthly (1st of month at 3am UTC)
  • Privacy: Raw IPs are hashed, private IPs (192.168.x.x) skipped

Database Update Workflow:

# Manual trigger
gh workflow run update-geoip-databases.yml -f environment=dev --repo digitalspace/penguin-analytics

# Or restart pods to download fresh databases
oc rollout restart deployment/penguin-analytics-api -n 6cdc9e-dev

Application Behavior

Application ANALYTICS_ENHANCED_TRACKING Behavior
eagle-public Respected Follows flag - privacy mode in prod
eagle-admin Ignored Always sends full context (staff app)

eagle-admin always sends full browser context because it's an authenticated staff application where usage tracking is expected.

Deployment

Penguin Analytics is deployed as a separate OpenShift application in the same namespace as EPIC services:

Pods:

  • penguin-analytics-api (2 replicas, with GeoIP initContainer)
  • penguin-analytics-database (TimescaleDB)
  • penguin-analytics-metabase (Metabase)

Routes:

  • /analytics → penguin-analytics-api (via rproxy)
  • Metabase accessible at dedicated route for authorized users

For deployment procedures, see Penguin Analytics Deployment.

Related Pages

⚠️ **GitHub.com Fallback** ⚠️