Operations + reliability

The deploy cadence

Two cadences, depending on where the change lives:

Surface	Where it deploys	Trigger
`apps/storefront/`, `apps/company-admin/`, `apps/internal-admin/`, `apps/docs/`	Vercel	Auto-deploy every push to `main`. Vercel’s `turbo-ignore` (per app’s `vercel.json`) skips rebuilds when nothing in the app workspace changed.
`apps/api/` (the NestJS API) + migrations	Fly.io (`merchos-api` app, `yyz` region)	Manual via `flyctl deploy --strategy=immediate --config apps/api/fly.toml --dockerfile apps/api/Dockerfile`.

The deploy-per-commit rule (CLAUDE.md §1.4.1)

Never stack multiple unshipped commits when any of them touch prisma/migrations/, apps/api/, or any service the API runs.

This rule was forced by two production outages:

Phase 0124 (2026-05-04) — 10.5h outage. An auto-resolve migration tripped on 41 phantom colors + 1 NULL canonicalStyleCode that hadn’t been caught pre-deploy. Recovery required flyctl mpg proxy + manual data scrub + prisma migrate resolve --rolled-back.
Phase 0125 (2026-05-06) — 14 stacked commits, deploy attempted only at the end. Three SQL bugs surfaced sequentially: apparel-only filter, gen_random_uuid inside SELECT DISTINCT, UNIQUE constraint on a many-to-one mapping. Each was a 5-min fix individually but compounded into multiple failed deploy cycles.

The lesson: the deploy is the verification. prisma validate + pnpm typecheck locally don’t catch SQL semantics, pgbouncer transaction boundaries, or prod-shape edge cases. Stacking commits compounds risk — a single broken migration blocks every subsequent deploy via _prisma_migrations failure tracking.

Cadence per CLAUDE.md §1.4.1:

Commit.
Push.
If the commit touched migrations / API / backend services: flyctl deploy and verify release_command ... completed successfully before the next commit.
Pure-frontend / pure-docs commits can stack — Vercel auto-deploys handle them.

Exception: when a long-running operator job (e.g. a 21h backfill) is in flight, deploys can be batched at the end to avoid killing the worker — per the 2026-05-19 batch-deploy strategy that kept Phase 0141 (split-tender) un-deployed for ~18h.

The deploy-strategy quirk

--strategy=immediate was forced by the Fly orchestration timeout lesson (docs/operations/fly-deploy-strategy.md):

auto_stop_machines = "stop" + min_machines_running = 1 races the rolling-restart health-check polling on a 2-machine fleet with low pre-launch traffic. release_command always succeeded; the timeout was on flyctl’s polling, not the actual deploy.

Switch back to --strategy=rolling when real customer traffic arrives post-launch.

The min-machines-running quirk

apps/api/fly.toml now has min_machines_running = 2 (bumped 2026-05-19). Was 1; that killed detached operator jobs (Phase 0117b refetch, Phase 0155 Gelato CA backfill) when Fly’s autoscaler saw no HTTP traffic on the worker machine. The Phase 0155 incident lost ~1.5h of backfill before we bumped this. Costs ~$10/mo more; eliminates operator-babysit overhead.

Post-deploy verification

Per docs/operations/migration-lock-recovery.md:

Post-deploy verification = curl /health AFTER rolling restart, not release_command success.

Forced by the 2026-05-18 Phase 0154 two-bug incident — release_command succeeded but the API failed to boot on a Nest DI lazy-type-import bug. Health check is the canonical “yes the system is alive.”

LAUNCH_CHECKLIST.md item: every Fly deploy ends with curl https://api.yourcustommerch.ca/api/v1/health returning 200.

The CI gate (CLAUDE.md §1.4.2)

Every push to main runs .github/workflows/ci.yml:

Typecheck across all packages (pnpm typecheck)
Lint (currency-literal, i18n parity, rules-of-hooks, typedRoutes)
Unit tests
Build
Integration tests (Postgres-backed, since 2026-05-19) — Postgres 15 service container + Prisma db push + pnpm --filter @merchos/api test:integration

The rule: after every push, check the CI run. A red CI badge that gets ignored is no signal at all. Forced by the 2026-05-17 incident where CI was red for 25 commits over ~17 hours because a currency-literal violation went unnoticed.

gh run list --workflow=ci.yml --limit 1 is the one-liner check. CI typically finishes in ~5 min.

Migrations + the CTE-shape rule

Prisma migrations under prisma/migrations/. Naming convention: YYYYMMDD[a-z]_phase_name.

Migrations run via release_command on every Fly deploy — a separate ephemeral machine spins up, runs prisma migrate deploy, then terminates. The API machines only start serving traffic after migrations complete.

The CTE-shape rule (2026-05-18 incident lesson):

Migrations that backfill via multi-table JOIN must use CTE shape (resolve target rows in WITH SELECT, then UPDATE FROM cte). Direct UPDATE T t SET … FROM X x INNER JOIN J j ON j.col = t.col is illegal in Postgres (E42P01) but passes SELECT-preview gates.

Forced by Phase 0154 Sub-phase 0116b — a UPDATE-FROM-INNER-JOIN backfill failed at deploy time, costing ~22 min of downtime. Recovery: rewrote the SQL as a CTE; redeployed; clean.

scripts/migration-dry-run.sh (LAUNCH_CHECKLIST 🔴 gate) flags this pattern + recommends the CTE rewrite. Run against prod via flyctl mpg proxy for any destructive auto-resolve migration.

The migration replay-from-scratch caveat (POST_DEPLOY_BACKLOG.md “Migration history can’t be replayed from scratch”):

A fresh database applying all migrations in lexicographic order hits ERROR: type "PricingModel" does not exist because 20260421_billing_discretion_and_payment_flow sorts BEFORE the enum-creating 20260421_vertical_gated_pricing_model (same day; underscore-after-letter ordering). Production is fine because migrations applied in author-time order. Both the CI e2e workflow and integration tests use prisma db push (applies current schema directly) instead of migrate deploy (replays history). Fix is low-priority — disaster recovery restores from pg_dump, not from migration replay.

Migration lock recovery

docs/operations/migration-lock-recovery.md. Prisma’s migration runner takes Postgres advisory lock 72707369 for the duration of migrate deploy. If the runner is killed mid-execution (Fly machine destroyed, network blip, OOM), the lock can stay held by an orphan backend → next deploy hangs forever.

apps/api/entrypoint.sh pre-flight check queries pg_locks WHERE objid=72707369; on stuck-lock, attempts pg_terminate_backend and prints actionable recovery SQL on fail.

Operator recovery:

flyctl mpg proxy 9g6y30w99wlov5ml opens a local tunnel to prod Postgres.
PGPASSWORD=<...> psql -h localhost -p 16380 -U fly-user -d fly-db
SELECT pid, granted FROM pg_locks WHERE objid = 72707369;
SELECT pg_terminate_backend(<pid>); for the granted holder.
Re-deploy.

Hit this during Phase 0155 (2026-05-19) when the release_command machine got killed by the flyctl orchestration timeout while still holding the lock; documented above.

Postgres + pgbouncer

Fly Managed Postgres (flympg) cluster 9g6y30w99wlov5ml in yyz. App connects via pgbouncer (pgbouncer.9g6y30w99wlov5ml.flympg.net:5432) — session mode, not transaction mode (per docs/operations/db-connection.md).

Session mode is required because Prisma’s $transaction({ isolationLevel }) + FOR UPDATE locks need a single backend across the whole transaction. Transaction-mode pgbouncer would hand the next statement to a different backend mid-transaction — silently breaking the FOR UPDATE lock on SpendRecord.

This is the most subtle operational fact in MerchOS. Switching pgbouncer modes silently breaks the spend-limit invariant; tests would still pass (they go direct to Postgres in CI); production would have intermittent over-cap orders that no log would explain.

Backups + PITR

docs/operations/db-recovery.md. Fly Managed Postgres has built-in point-in-time recovery with daily snapshots + 7-day retention. PITR window: 5 min granularity within the last 7 days.

Operator drill: flyctl postgres restore against a target timestamp. The restore creates a NEW Postgres cluster; the operator swaps the DATABASE_URL secret + restarts the API.

Pre-launch we ran one DR drill (Phase 9). Post-launch the POST_DEPLOY_BACKLOG.md “Pre-revenue reliability hardening” item plans a quarterly drill.

Integration tests

docs/operations/integration-tests.md. Phase 0155 (2026-05-19) added Postgres-backed integration tests.

When to write one vs a unit test:

Unit test (matches *.spec.ts, mocks Prisma) — pure transformation, validators, formatters, predicates.
Integration test (matches *.int.spec.ts, real Prisma) — anything where correctness depends on database semantics: FK CASCADE / NoAction, SQL JOINs through Prisma include, multi-row state diffs, materialized views, multi-table semantics.

Running locally:

# Start a Postgres container (one-time)
docker run -d --name merchos-test-pg \
  -e POSTGRES_USER=merchos -e POSTGRES_PASSWORD=dev \
  -e POSTGRES_DB=merchos_test -p 5432:5432 postgres:15

# Run the suite
TEST_DATABASE_URL=postgresql://merchos:dev@localhost:5432/merchos_test \
  pnpm --filter @merchos/api test:integration

TEST_DATABASE_URL is required — globalSetup refuses to run without it (prevents accidentally pointing at prod). setupIntegrationDb() registers a beforeEach that TRUNCATEs every non-_prisma_migrations table; maxWorkers: 1 serializes runs.

Today’s coverage: 32 tests across 5 suites (master.service.int.spec.ts, sku-variant-cell-state.service.int.spec.ts, spend-settings.int.spec.ts, split-tender-webhook.int.spec.ts, split-tender-refund.int.spec.ts). More to add — see POST_DEPLOY_BACKLOG.md “Test framework rollout.”

Observability — current state

Honest answer: partial.

Layer	Current	Gap
Application logs	Fly’s stdout (`flyctl logs`) — JSON-line	No log aggregation; grep-on-Fly only
Application error tracking	Sentry wired into the API (`docs/operations/sentry-setup.md`)	Frontends use Sentry too; integration is solid
Uptime monitoring	`docs/operations/uptime-monitoring-setup.md`	External monitor pinging `/health`
Stripe webhook delivery	Stripe dashboard	No proactive alert on delivery failures
Cron job execution	`@Cron` logs in stdout	No “this cron didn’t run today” alert
Catalog health (sync staleness, unresolvable CuratedStyles)	`PlatformInvariantService` boot-time check	Runtime drift goes undetected until next boot
Margin drift (Gelato cost increases)	Not implemented	Post-launch — see POST_DEPLOY_BACKLOG.md
Performance / latency	Vercel Analytics on frontends; nothing on API	No APM

Pre-launch we’re accepting these gaps. Post-launch the POST_DEPLOY_BACKLOG.md “Catalog health observability dashboard” + “Margin-erosion alert layer” items address the most operationally critical ones.

Runbooks

The docs/operations/ directory is the on-call reference:

Runbook	When you read this
`db-connection.md`	Touching pgbouncer config or debugging transaction-mode behavior
`db-recovery.md`	PITR restore; data-loss event
`fly-deploy-strategy.md`	Deploy hangs / health-check polling timeout
`integration-tests.md`	Adding a `*.int.spec.ts` suite or debugging an integration test
`migration-lock-recovery.md`	Stuck migration lock (`pg_locks objid=72707369`)
`r2-cors-config.md`	Cloudflare R2 bucket CORS issues
`sentry-setup.md`	Sentry configuration / DSN rotation
`tier-swap-compensation.md`	Tier-swap rolled back; need to manually compensate Stripe
`uptime-monitoring-setup.md`	Uptime monitor not paging on outage

Plus FIRST_ORDER_RUNBOOK.md at repo root for the first-real-customer end-to-end walk + the 168-line Printful supplement (added 2026-05-19).

What’s next

Conventions — CLAUDE.md rules + the ADR pattern.
Reference index — links to every ADR, service note, runbook, phase.

Canonical sources

CLAUDE.md §1.4.1, §1.4.2 — deploy + CI cadence rules.
apps/api/fly.toml — Fly config (min_machines_running, memory, release_command).
apps/api/entrypoint.sh — pre-flight migration-lock check + entrypoint.
apps/api/jest.integration.config.js — integration-test runner config.
apps/api/src/__test-helpers__/ — setupIntegrationDb, fixture builders.
scripts/migration-dry-run.sh — destructive-migration gate.
scripts/check-migration-lock.cjs — pre-flight lock check called from entrypoint.sh.
docs/operations/ — every runbook listed above.
FIRST_ORDER_RUNBOOK.md — first-real-customer walk.
LAUNCH_CHECKLIST.md — pre-launch operator checklist.

Triggers for update

Update this chapter if you:

Add or remove a CLAUDE.md operational rule (today: §1.4.1 deploy-per-commit, §1.4.2 CI verification).
Change Fly fly.toml topology (min_machines_running, region, machine size).
Change pgbouncer mode (session → transaction or back) — this would be a regression on the spend-limit invariant; flag it.
Add or change the deploy strategy (immediate ↔ rolling).
Add a new operational runbook to docs/operations/.
Land a new observability layer (APM, log aggregation, cron-execution alerting).
Change the CTE-shape migration rule.
Add or remove a CI gate.
Change the integration-test pattern (today: *.int.spec.ts + prisma db push + maxWorkers=1).