Data Strategy - bounswe/bounswe2026group9 GitHub Wiki

Data Strategy

Community Event Platform — bounswe2026group9

1. Overall Approach

Our test data strategy follows an integration-first philosophy: tests run against a real Supabase PostgreSQL database rather than isolated mocks. This ensures that schema constraints (NOT NULL, FK, CHECK), triggers, and pg_cron jobs (e.g. the automatic transition to ended) are validated under real behavior. Only external side-effects (SMTP, OAuth callbacks, Supabase storage uploads) are mocked for determinism.

The strategy is defined across three layers:

Layer Purpose Volume
L1 — Unit fixtures Single endpoint / function 1–3 rows, single user
L2 — Integration suite Multi-step API journeys 5–20 rows, isolated run
L3 — E2E / Scenario seed Discovery, Suggested, Notification scenarios, NFR validation ~10,000 events, ~500 users

2. Test Data Accumulation Strategy

2.1 Run-scoped, ephemeral data (L1 + L2)

Each test run produces its own isolated data space. From backend/tests_support.py:

TEST_RUN_ID = _normalized_run_id()        # CI run id or uuid hex[:8]
USERNAME_RUN_ID = TEST_RUN_ID[:6]

def build_test_identity(prefix):
    unique = uuid.uuid4().hex[:8]
    username = f"{prefix}_{USERNAME_RUN_ID}_{unique}"
    email    = f"{prefix}_{TEST_RUN_ID}_{unique}@example.com"
    return username, email

def cleanup_email_pattern():
    return f"%_{TEST_RUN_ID}_%@example.com"
  • Prefix convention: testuser_, eventtest_, imgtest_, notiftest_ — every domain uses its own prefix.
  • Cleanup: the autouse=True fixture in conftest.py deletes test data via LIKE pattern after each test; FK cascades remove related events / comments / notifications.
  • No collisions: parallel CI jobs receive different TEST_RUN_IDs and therefore never interfere even when sharing tables.

2.2 Scenario seed corpus (L3)

Suggested events, recommendations, and notification flows cannot be tested without a meaningful history. A new backend/seed/scenario_seed.py provides:

Set Content Purpose
users_attendance_clusters 500 users grouped into 4 interest clusters (sport, music, tech, food) Suggested filter & "based on attended events" recommendation tests
events_temporal_spread 10,000 events — past/present/future mix, 81 city coordinates NFR-Scalability (10K events, 2 s search) and map pin density
attendance_history ~8 going records per user, distributed within their cluster Deterministic recommendation algorithm tests
comments_threaded 30% of events with 2–10 comments + 1 reply Comment section + parent_id flow (migration 011)
notifications_backlog Mix of bookmark / going / cancellation events Notification list, mark-as-read, recommendations via notifications

The seed is gated by a SEED_SCENARIO=1 env variable; it runs only against staging / test Supabase projects to prevent any leakage into production.

2.3 Snapshot & restore

Because the L3 seed is expensive, a versioned snapshot is produced via pg_dump (backend/seed/snapshots/<date>.sql.gz). E2E runs pg_restore first, so every PR is tested against the same baseline and the flaky-test surface shrinks.


3. Synthetic vs. Existing Data

Data Type Source Rationale
User profiles Synthetic (Faker — faker.providers.person, address) Real user data carries privacy risk; synthetic data is sufficient
Event titles / descriptions Semi-synthetic: 81 cities × category templates ("Istanbul Yoga Meetup", "Ankara Jazz Night"); Lorem ipsum is forbidden Realistic appearance in UI and demos
Coordinates Turkish city centers ± Gaussian noise (~3 km std) Realistic clustering for map density tests
Date / time Programmatic (now ± [-180, +180] days; 20% in the past) Required to test the automatic ended transition driven by pg_cron
Images In-memory JPEG via PIL (PILImage.new("RGB", (100,100), "red")) + MOCK_STORAGE_URL patch Zero storage cost, deterministic
Categories The 15 existing predefined categories (migration 002) + 5 user-created custom ones No need to re-create the seed
Comments Faker text() plus a small Turkish keyword pool Realistic UI content
Email / auth secrets Synthetic, @example.com domain Prevents accidental real SMTP delivery

Mocked components: SMTP (send_verification_email), OAuth callback, Supabase storage upload, and the pg_cron scheduler (fast-forwarded in E2E).


4. Methodology

4.1 Backend — pytest + integration-first

  • Fixture pyramid: session-scoped DB client → function-scoped user/event factories → autouse cleanup.
  • Builder helpers: _valid_event_body(), _create_published_event() — expose only override-able fields, sane defaults for the rest.
  • Property-based smoke: hypothesis checks invariants such as start_datetime < end_datetime and capacity ≥ 0 against random inputs.

4.2 Frontend — Jest + RTL + MSW

  • API calls are intercepted with MSW handlers; their responses are derived from the same seed JSON snapshots, guaranteeing schema parity with the backend.
  • Component tests rely on canned fixtures (frontend/tests/fixtures/events.json) instead of Faker so changes are easy to track.

4.3 Mobile — JUnit + MockK + Compose Testing

  • Repositories are tested with MockK; Retrofit responses come from the same seed JSON.
  • UI tests use Compose createComposeRule(), with the same fixture set powering navigation and deep-link checks.

4.4 E2E / Scenario testing

  • Tools: Playwright (web) + Maestro (Android).
  • Scope: not unit-level, but user journeys:
    1. Community Loop: register → going on three sports events → "Suggested" filter on Discovery surfaces sports events first → user receives a similar-event notification.
    2. Host Lifecycle: create draft → upload image → publish → reply to a comment → cancel → notification fan-out.
    3. Private Access: join via invite token → access request → host approval.
    4. Capacity & Age Gate: "Going" blocked on a full event + 18+ underage block.

5. Validating Test Data Realism

Test data is not "valid because it was generated"; it must demonstrate production-like behavior. Validation is performed in three layers:

5.1 Statistical validation

Metric Target distribution Validation method
Event / category distribution Long-tail (top 5 categories ≈ 60% of total) SELECT category, COUNT(*) + Gini coefficient
Going / event ratio Mean 8, std 5, max ≤ capacity Histogram + capacity-violation assertion
Geographic distribution 81 cities, top 10 ≈ 70% of total KDE plot + sanity check
Comment density Comments on 30% of events, average 4 EXISTS + AVG queries

A run that deviates from the target by more than ±20% is treated as a "stale seed" and the snapshot is regenerated.

5.2 Behavioral validation (NFR)

  • Performance: with 10K events seeded → discovery search p95 < 2 s (k6 load test).
  • Reliability: under 500 concurrent simulated users, HTTP 5xx rate < 1%.
  • Visibility: cancelling an event → it disappears from Discovery within 60 s (polling test).

5.3 Scenario walkthrough

Each named user scenario is walked through against the seed:

Scenario Validation evidence
User who moves to a new city receives recommendations based on past attendance E2E: user_42 → relocates to Istanbul → "Suggested" tab returns events matching their historical cluster
Event recommendation through notifications Cron job → user receives a notification for a new event matching their interests, visible in the notification feed
Similar event recommendations on the event detail page Event detail page → "You may also like" block returns ≥3 results from the same category
Discovery — "Suggested" filter Filter panel "Suggested" → backend ?suggested=true → results sorted by the user's attendance vector
Host leaderboard Top hosts query → ordered by ratings × event count desc

Each scenario is captured as assertion + screenshot in the Playwright report; the demo flow is reproducible without manual UAT.

5.4 Schema drift protection

  • backend/sql/0**.sql migrations are run against a clean DB on every CI build.
  • The seed file is validated with pydantic models, so any backend response-model change makes the seed generator fail fast.

6. Risks and Mitigations

Risk Impact Mitigation
Seed corpus leaking into production High SEED_SCENARIO env gate + a different RLS role on the production Supabase
Recommendation algorithm performs well on synthetic data but poorly on real data Medium Validate against anonymized attendance patterns from a ~20-person beta test group
10K-event snapshot inflating CI time Medium Snapshot restore runs only in the nightly E2E job; PR jobs are limited to L1+L2
pg_cron jobs causing race conditions in the test environment Low Cron disabled in tests; the same behavior is exercised by calling the underlying RPC directly

7. Responsibilities

Area Owner
Scenario seed implementation Backend team
Playwright E2E (web) Frontend team
Maestro E2E (mobile) Mobile team
NFR load testing (k6) Backend team + DevOps
Validation report (Section 5) — per milestone QA lead
⚠️ **GitHub.com Fallback** ⚠️