Data Strategy

Community Event Platform — bounswe2026group9

1. Overall Approach

Our test data strategy follows an integration-first philosophy: tests run against a real Supabase PostgreSQL database rather than isolated mocks. This ensures that schema constraints (NOT NULL, FK, CHECK), triggers, and pg_cron jobs (e.g. the automatic transition to ended) are validated under real behavior. Only external side-effects (SMTP, OAuth callbacks, Supabase storage uploads) are mocked for determinism.

The strategy is defined across three layers:

Layer	Purpose	Volume
L1 — Unit fixtures	Single endpoint / function	1–3 rows, single user
L2 — Integration suite	Multi-step API journeys	5–20 rows, isolated run
L3 — E2E / Scenario seed	Discovery, Suggested, Notification scenarios, NFR validation	~10,000 events, ~500 users

2. Test Data Accumulation Strategy

2.1 Run-scoped, ephemeral data (L1 + L2)

Each test run produces its own isolated data space. From backend/tests_support.py:

TEST_RUN_ID = _normalized_run_id()        # CI run id or uuid hex[:8]
USERNAME_RUN_ID = TEST_RUN_ID[:6]

def build_test_identity(prefix):
    unique = uuid.uuid4().hex[:8]
    username = f"{prefix}_{USERNAME_RUN_ID}_{unique}"
    email    = f"{prefix}_{TEST_RUN_ID}_{unique}@example.com"
    return username, email

def cleanup_email_pattern():
    return f"%_{TEST_RUN_ID}_%@example.com"

Prefix convention: testuser_, eventtest_, imgtest_, notiftest_ — every domain uses its own prefix.
Cleanup: the autouse=True fixture in conftest.py deletes test data via LIKE pattern after each test; FK cascades remove related events / comments / notifications.
No collisions: parallel CI jobs receive different TEST_RUN_IDs and therefore never interfere even when sharing tables.

2.2 Scenario seed corpus (L3)

Suggested events, recommendations, and notification flows cannot be tested without a meaningful history. A new backend/seed/scenario_seed.py provides:

Set	Content	Purpose
`users_attendance_clusters`	500 users grouped into 4 interest clusters (sport, music, tech, food)	Suggested filter & "based on attended events" recommendation tests
`events_temporal_spread`	10,000 events — past/present/future mix, 81 city coordinates	NFR-Scalability (10K events, 2 s search) and map pin density
`attendance_history`	~8 going records per user, distributed within their cluster	Deterministic recommendation algorithm tests
`comments_threaded`	30% of events with 2–10 comments + 1 reply	Comment section + parent_id flow (migration 011)
`notifications_backlog`	Mix of bookmark / going / cancellation events	Notification list, mark-as-read, recommendations via notifications

The seed is gated by a SEED_SCENARIO=1 env variable; it runs only against staging / test Supabase projects to prevent any leakage into production.

2.3 Snapshot & restore

Because the L3 seed is expensive, a versioned snapshot is produced via pg_dump (backend/seed/snapshots/<date>.sql.gz). E2E runs pg_restore first, so every PR is tested against the same baseline and the flaky-test surface shrinks.

3. Synthetic vs. Existing Data

Data Type	Source	Rationale
User profiles	Synthetic (Faker — `faker.providers.person`, `address`)	Real user data carries privacy risk; synthetic data is sufficient
Event titles / descriptions	Semi-synthetic: 81 cities × category templates ("Istanbul Yoga Meetup", "Ankara Jazz Night"); Lorem ipsum is forbidden	Realistic appearance in UI and demos
Coordinates	Turkish city centers ± Gaussian noise (~3 km std)	Realistic clustering for map density tests
Date / time	Programmatic (now ± [-180, +180] days; 20% in the past)	Required to test the automatic `ended` transition driven by `pg_cron`
Images	In-memory JPEG via PIL (`PILImage.new("RGB", (100,100), "red")`) + `MOCK_STORAGE_URL` patch	Zero storage cost, deterministic
Categories	The 15 existing predefined categories (`migration 002`) + 5 user-created custom ones	No need to re-create the seed
Comments	Faker `text()` plus a small Turkish keyword pool	Realistic UI content
Email / auth secrets	Synthetic, `@example.com` domain	Prevents accidental real SMTP delivery

Mocked components: SMTP (send_verification_email), OAuth callback, Supabase storage upload, and the pg_cron scheduler (fast-forwarded in E2E).

4. Methodology

4.1 Backend — pytest + integration-first

Fixture pyramid: session-scoped DB client → function-scoped user/event factories → autouse cleanup.
Builder helpers: _valid_event_body(), _create_published_event() — expose only override-able fields, sane defaults for the rest.
Property-based smoke: hypothesis checks invariants such as start_datetime < end_datetime and capacity ≥ 0 against random inputs.

4.2 Frontend — Jest + RTL + MSW

API calls are intercepted with MSW handlers; their responses are derived from the same seed JSON snapshots, guaranteeing schema parity with the backend.
Component tests rely on canned fixtures (frontend/tests/fixtures/events.json) instead of Faker so changes are easy to track.

4.3 Mobile — JUnit + MockK + Compose Testing

Repositories are tested with MockK; Retrofit responses come from the same seed JSON.
UI tests use Compose createComposeRule(), with the same fixture set powering navigation and deep-link checks.

4.4 E2E / Scenario testing

Tools: Playwright (web) + Maestro (Android).
Scope: not unit-level, but user journeys:
1. Community Loop: register → going on three sports events → "Suggested" filter on Discovery surfaces sports events first → user receives a similar-event notification.
2. Host Lifecycle: create draft → upload image → publish → reply to a comment → cancel → notification fan-out.
3. Private Access: join via invite token → access request → host approval.
4. Capacity & Age Gate: "Going" blocked on a full event + 18+ underage block.

5. Validating Test Data Realism

Test data is not "valid because it was generated"; it must demonstrate production-like behavior. Validation is performed in three layers:

5.1 Statistical validation

Metric	Target distribution	Validation method
Event / category distribution	Long-tail (top 5 categories ≈ 60% of total)	`SELECT category, COUNT(*)` + Gini coefficient
Going / event ratio	Mean 8, std 5, max ≤ capacity	Histogram + capacity-violation assertion
Geographic distribution	81 cities, top 10 ≈ 70% of total	KDE plot + sanity check
Comment density	Comments on 30% of events, average 4	`EXISTS` + AVG queries

A run that deviates from the target by more than ±20% is treated as a "stale seed" and the snapshot is regenerated.

5.2 Behavioral validation (NFR)

Performance: with 10K events seeded → discovery search p95 < 2 s (k6 load test).
Reliability: under 500 concurrent simulated users, HTTP 5xx rate < 1%.
Visibility: cancelling an event → it disappears from Discovery within 60 s (polling test).

5.3 Scenario walkthrough

Each named user scenario is walked through against the seed:

Scenario	Validation evidence
User who moves to a new city receives recommendations based on past attendance	E2E: user_42 → relocates to Istanbul → "Suggested" tab returns events matching their historical cluster
Event recommendation through notifications	Cron job → user receives a notification for a new event matching their interests, visible in the notification feed
Similar event recommendations on the event detail page	Event detail page → "You may also like" block returns ≥3 results from the same category
Discovery — "Suggested" filter	Filter panel "Suggested" → backend `?suggested=true` → results sorted by the user's attendance vector
Host leaderboard	Top hosts query → ordered by `ratings × event count` desc

Each scenario is captured as assertion + screenshot in the Playwright report; the demo flow is reproducible without manual UAT.

5.4 Schema drift protection

backend/sql/0**.sql migrations are run against a clean DB on every CI build.
The seed file is validated with pydantic models, so any backend response-model change makes the seed generator fail fast.

6. Risks and Mitigations

Risk	Impact	Mitigation
Seed corpus leaking into production	High	`SEED_SCENARIO` env gate + a different RLS role on the production Supabase
Recommendation algorithm performs well on synthetic data but poorly on real data	Medium	Validate against anonymized attendance patterns from a ~20-person beta test group
10K-event snapshot inflating CI time	Medium	Snapshot restore runs only in the nightly E2E job; PR jobs are limited to L1+L2
`pg_cron` jobs causing race conditions in the test environment	Low	Cron disabled in tests; the same behavior is exercised by calling the underlying RPC directly

7. Responsibilities

Area	Owner
Scenario seed implementation	Backend team
Playwright E2E (web)	Frontend team
Maestro E2E (mobile)	Mobile team
NFR load testing (k6)	Backend team + DevOps
Validation report (Section 5) — per milestone	QA lead

Data Strategy - bounswe/bounswe2026group9 GitHub Wiki

Data Strategy

1. Overall Approach

2. Test Data Accumulation Strategy

2.1 Run-scoped, ephemeral data (L1 + L2)

2.2 Scenario seed corpus (L3)

2.3 Snapshot & restore

3. Synthetic vs. Existing Data

4. Methodology

4.1 Backend — pytest + integration-first

4.2 Frontend — Jest + RTL + MSW

4.3 Mobile — JUnit + MockK + Compose Testing

4.4 E2E / Scenario testing

5. Validating Test Data Realism

5.1 Statistical validation

5.2 Behavioral validation (NFR)

5.3 Scenario walkthrough

5.4 Schema drift protection

6. Risks and Mitigations

7. Responsibilities

⚠️ GitHub.com Fallback ⚠️

Data Strategy - bounswe/bounswe2026group9 GitHub Wiki

Data Strategy

1. Overall Approach

2. Test Data Accumulation Strategy

2.1 Run-scoped, ephemeral data (L1 + L2)

2.2 Scenario seed corpus (L3)

2.3 Snapshot & restore

3. Synthetic vs. Existing Data

4. Methodology

4.1 Backend — pytest + integration-first

4.2 Frontend — Jest + RTL + MSW

4.3 Mobile — JUnit + MockK + Compose Testing

4.4 E2E / Scenario testing

5. Validating Test Data Realism

5.1 Statistical validation

5.2 Behavioral validation (NFR)

5.3 Scenario walkthrough

5.4 Schema drift protection

6. Risks and Mitigations

7. Responsibilities

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️