Operations + reliability
The deploy cadence
Two cadences, depending on where the change lives:
| Surface | Where it deploys | Trigger |
|---|---|---|
apps/storefront/, apps/company-admin/, apps/internal-admin/, apps/docs/ | Vercel | Auto-deploy every push to main. Vercel’s turbo-ignore (per app’s vercel.json) skips rebuilds when nothing in the app workspace changed. |
apps/api/ (the NestJS API) + migrations | Fly.io (merchos-api app, yyz region) | Manual via flyctl deploy --strategy=immediate --config apps/api/fly.toml --dockerfile apps/api/Dockerfile. |
The deploy-per-commit rule (CLAUDE.md §1.4.1)
Never stack multiple unshipped commits when any of them touch
prisma/migrations/,apps/api/, or any service the API runs.
This rule was forced by two production outages:
- Phase 0124 (2026-05-04) — 10.5h outage. An auto-resolve migration tripped on 41 phantom colors + 1 NULL
canonicalStyleCodethat hadn’t been caught pre-deploy. Recovery requiredflyctl mpg proxy+ manual data scrub +prisma migrate resolve --rolled-back. - Phase 0125 (2026-05-06) — 14 stacked commits, deploy attempted only at the end. Three SQL bugs surfaced sequentially: apparel-only filter,
gen_random_uuidinsideSELECT DISTINCT, UNIQUE constraint on a many-to-one mapping. Each was a 5-min fix individually but compounded into multiple failed deploy cycles.
The lesson: the deploy is the verification. prisma validate + pnpm typecheck locally don’t catch SQL semantics, pgbouncer transaction boundaries, or prod-shape edge cases. Stacking commits compounds risk — a single broken migration blocks every subsequent deploy via _prisma_migrations failure tracking.
Cadence per CLAUDE.md §1.4.1:
- Commit.
- Push.
- If the commit touched migrations / API / backend services:
flyctl deployand verifyrelease_command ... completed successfullybefore the next commit. - Pure-frontend / pure-docs commits can stack — Vercel auto-deploys handle them.
Exception: when a long-running operator job (e.g. a 21h backfill) is in flight, deploys can be batched at the end to avoid killing the worker — per the 2026-05-19 batch-deploy strategy that kept Phase 0141 (split-tender) un-deployed for ~18h.
The deploy-strategy quirk
--strategy=immediate was forced by the Fly orchestration timeout lesson (docs/operations/fly-deploy-strategy.md):
auto_stop_machines = "stop"+min_machines_running = 1races the rolling-restart health-check polling on a 2-machine fleet with low pre-launch traffic.release_commandalways succeeded; the timeout was on flyctl’s polling, not the actual deploy.
Switch back to --strategy=rolling when real customer traffic arrives post-launch.
The min-machines-running quirk
apps/api/fly.toml now has min_machines_running = 2 (bumped 2026-05-19). Was 1; that killed detached operator jobs (Phase 0117b refetch, Phase 0155 Gelato CA backfill) when Fly’s autoscaler saw no HTTP traffic on the worker machine. The Phase 0155 incident lost ~1.5h of backfill before we bumped this. Costs ~$10/mo more; eliminates operator-babysit overhead.
Post-deploy verification
Per docs/operations/migration-lock-recovery.md:
Post-deploy verification =
curl /healthAFTER rolling restart, not release_command success.
Forced by the 2026-05-18 Phase 0154 two-bug incident — release_command succeeded but the API failed to boot on a Nest DI lazy-type-import bug. Health check is the canonical “yes the system is alive.”
LAUNCH_CHECKLIST.md item: every Fly deploy ends with curl https://api.yourcustommerch.ca/api/v1/health returning 200.
The CI gate (CLAUDE.md §1.4.2)
Every push to main runs .github/workflows/ci.yml:
- Typecheck across all packages (
pnpm typecheck) - Lint (currency-literal, i18n parity, rules-of-hooks, typedRoutes)
- Unit tests
- Build
- Integration tests (Postgres-backed, since 2026-05-19) — Postgres 15 service container + Prisma
db push+pnpm --filter @merchos/api test:integration
The rule: after every push, check the CI run. A red CI badge that gets ignored is no signal at all. Forced by the 2026-05-17 incident where CI was red for 25 commits over ~17 hours because a currency-literal violation went unnoticed.
gh run list --workflow=ci.yml --limit 1 is the one-liner check. CI typically finishes in ~5 min.
Migrations + the CTE-shape rule
Prisma migrations under prisma/migrations/. Naming convention: YYYYMMDD[a-z]_phase_name.
Migrations run via release_command on every Fly deploy — a separate ephemeral machine spins up, runs prisma migrate deploy, then terminates. The API machines only start serving traffic after migrations complete.
The CTE-shape rule (2026-05-18 incident lesson):
Migrations that backfill via multi-table JOIN must use CTE shape (resolve target rows in
WITH SELECT, thenUPDATE FROM cte). DirectUPDATE T t SET … FROM X x INNER JOIN J j ON j.col = t.colis illegal in Postgres (E42P01) but passes SELECT-preview gates.
Forced by Phase 0154 Sub-phase 0116b — a UPDATE-FROM-INNER-JOIN backfill failed at deploy time, costing ~22 min of downtime. Recovery: rewrote the SQL as a CTE; redeployed; clean.
scripts/migration-dry-run.sh (LAUNCH_CHECKLIST 🔴 gate) flags this pattern + recommends the CTE rewrite. Run against prod via flyctl mpg proxy for any destructive auto-resolve migration.
The migration replay-from-scratch caveat (POST_DEPLOY_BACKLOG.md “Migration history can’t be replayed from scratch”):
A fresh database applying all migrations in lexicographic order hits ERROR: type "PricingModel" does not exist because 20260421_billing_discretion_and_payment_flow sorts BEFORE the enum-creating 20260421_vertical_gated_pricing_model (same day; underscore-after-letter ordering). Production is fine because migrations applied in author-time order. Both the CI e2e workflow and integration tests use prisma db push (applies current schema directly) instead of migrate deploy (replays history). Fix is low-priority — disaster recovery restores from pg_dump, not from migration replay.
Migration lock recovery
docs/operations/migration-lock-recovery.md. Prisma’s migration runner takes Postgres advisory lock 72707369 for the duration of migrate deploy. If the runner is killed mid-execution (Fly machine destroyed, network blip, OOM), the lock can stay held by an orphan backend → next deploy hangs forever.
apps/api/entrypoint.sh pre-flight check queries pg_locks WHERE objid=72707369; on stuck-lock, attempts pg_terminate_backend and prints actionable recovery SQL on fail.
Operator recovery:
flyctl mpg proxy 9g6y30w99wlov5mlopens a local tunnel to prod Postgres.PGPASSWORD=<...> psql -h localhost -p 16380 -U fly-user -d fly-dbSELECT pid, granted FROM pg_locks WHERE objid = 72707369;SELECT pg_terminate_backend(<pid>);for the granted holder.- Re-deploy.
Hit this during Phase 0155 (2026-05-19) when the release_command machine got killed by the flyctl orchestration timeout while still holding the lock; documented above.
Postgres + pgbouncer
Fly Managed Postgres (flympg) cluster 9g6y30w99wlov5ml in yyz. App connects via pgbouncer (pgbouncer.9g6y30w99wlov5ml.flympg.net:5432) — session mode, not transaction mode (per docs/operations/db-connection.md).
Session mode is required because Prisma’s $transaction({ isolationLevel }) + FOR UPDATE locks need a single backend across the whole transaction. Transaction-mode pgbouncer would hand the next statement to a different backend mid-transaction — silently breaking the FOR UPDATE lock on SpendRecord.
This is the most subtle operational fact in MerchOS. Switching pgbouncer modes silently breaks the spend-limit invariant; tests would still pass (they go direct to Postgres in CI); production would have intermittent over-cap orders that no log would explain.
Backups + PITR
docs/operations/db-recovery.md. Fly Managed Postgres has built-in point-in-time recovery with daily snapshots + 7-day retention. PITR window: 5 min granularity within the last 7 days.
Operator drill: flyctl postgres restore against a target timestamp. The restore creates a NEW Postgres cluster; the operator swaps the DATABASE_URL secret + restarts the API.
Pre-launch we ran one DR drill (Phase 9). Post-launch the POST_DEPLOY_BACKLOG.md “Pre-revenue reliability hardening” item plans a quarterly drill.
Integration tests
docs/operations/integration-tests.md. Phase 0155 (2026-05-19) added Postgres-backed integration tests.
When to write one vs a unit test:
- Unit test (matches
*.spec.ts, mocks Prisma) — pure transformation, validators, formatters, predicates. - Integration test (matches
*.int.spec.ts, real Prisma) — anything where correctness depends on database semantics: FK CASCADE / NoAction, SQL JOINs through Prismainclude, multi-row state diffs, materialized views, multi-table semantics.
Running locally:
# Start a Postgres container (one-time)docker run -d --name merchos-test-pg \ -e POSTGRES_USER=merchos -e POSTGRES_PASSWORD=dev \ -e POSTGRES_DB=merchos_test -p 5432:5432 postgres:15
# Run the suiteTEST_DATABASE_URL=postgresql://merchos:dev@localhost:5432/merchos_test \ pnpm --filter @merchos/api test:integrationTEST_DATABASE_URL is required — globalSetup refuses to run without it (prevents accidentally pointing at prod). setupIntegrationDb() registers a beforeEach that TRUNCATEs every non-_prisma_migrations table; maxWorkers: 1 serializes runs.
Today’s coverage: 32 tests across 5 suites (master.service.int.spec.ts, sku-variant-cell-state.service.int.spec.ts, spend-settings.int.spec.ts, split-tender-webhook.int.spec.ts, split-tender-refund.int.spec.ts). More to add — see POST_DEPLOY_BACKLOG.md “Test framework rollout.”
Observability — current state
Honest answer: partial.
| Layer | Current | Gap |
|---|---|---|
| Application logs | Fly’s stdout (flyctl logs) — JSON-line | No log aggregation; grep-on-Fly only |
| Application error tracking | Sentry wired into the API (docs/operations/sentry-setup.md) | Frontends use Sentry too; integration is solid |
| Uptime monitoring | docs/operations/uptime-monitoring-setup.md | External monitor pinging /health |
| Stripe webhook delivery | Stripe dashboard | No proactive alert on delivery failures |
| Cron job execution | @Cron logs in stdout | No “this cron didn’t run today” alert |
| Catalog health (sync staleness, unresolvable CuratedStyles) | PlatformInvariantService boot-time check | Runtime drift goes undetected until next boot |
| Margin drift (Gelato cost increases) | Not implemented | Post-launch — see POST_DEPLOY_BACKLOG.md |
| Performance / latency | Vercel Analytics on frontends; nothing on API | No APM |
Pre-launch we’re accepting these gaps. Post-launch the POST_DEPLOY_BACKLOG.md “Catalog health observability dashboard” + “Margin-erosion alert layer” items address the most operationally critical ones.
Runbooks
The docs/operations/ directory is the on-call reference:
| Runbook | When you read this |
|---|---|
db-connection.md | Touching pgbouncer config or debugging transaction-mode behavior |
db-recovery.md | PITR restore; data-loss event |
fly-deploy-strategy.md | Deploy hangs / health-check polling timeout |
integration-tests.md | Adding a *.int.spec.ts suite or debugging an integration test |
migration-lock-recovery.md | Stuck migration lock (pg_locks objid=72707369) |
r2-cors-config.md | Cloudflare R2 bucket CORS issues |
sentry-setup.md | Sentry configuration / DSN rotation |
tier-swap-compensation.md | Tier-swap rolled back; need to manually compensate Stripe |
uptime-monitoring-setup.md | Uptime monitor not paging on outage |
Plus FIRST_ORDER_RUNBOOK.md at repo root for the first-real-customer end-to-end walk + the 168-line Printful supplement (added 2026-05-19).
What’s next
- Conventions — CLAUDE.md rules + the ADR pattern.
- Reference index — links to every ADR, service note, runbook, phase.
Canonical sources
CLAUDE.md§1.4.1, §1.4.2 — deploy + CI cadence rules.apps/api/fly.toml— Fly config (min_machines_running, memory,release_command).apps/api/entrypoint.sh— pre-flight migration-lock check + entrypoint.apps/api/jest.integration.config.js— integration-test runner config.apps/api/src/__test-helpers__/—setupIntegrationDb, fixture builders.scripts/migration-dry-run.sh— destructive-migration gate.scripts/check-migration-lock.cjs— pre-flight lock check called from entrypoint.sh.docs/operations/— every runbook listed above.FIRST_ORDER_RUNBOOK.md— first-real-customer walk.LAUNCH_CHECKLIST.md— pre-launch operator checklist.
Triggers for update
Update this chapter if you:
- Add or remove a CLAUDE.md operational rule (today: §1.4.1 deploy-per-commit, §1.4.2 CI verification).
- Change Fly
fly.tomltopology (min_machines_running, region, machine size). - Change pgbouncer mode (session → transaction or back) — this would be a regression on the spend-limit invariant; flag it.
- Add or change the deploy strategy (
immediate↔rolling). - Add a new operational runbook to
docs/operations/. - Land a new observability layer (APM, log aggregation, cron-execution alerting).
- Change the CTE-shape migration rule.
- Add or remove a CI gate.
- Change the integration-test pattern (today:
*.int.spec.ts+prisma db push+ maxWorkers=1).