Skip to content

Operations + reliability

The deploy cadence

Two cadences, depending on where the change lives:

SurfaceWhere it deploysTrigger
apps/storefront/, apps/company-admin/, apps/internal-admin/, apps/docs/VercelAuto-deploy every push to main. Vercel’s turbo-ignore (per app’s vercel.json) skips rebuilds when nothing in the app workspace changed.
apps/api/ (the NestJS API) + migrationsFly.io (merchos-api app, yyz region)Manual via flyctl deploy --strategy=immediate --config apps/api/fly.toml --dockerfile apps/api/Dockerfile.

The deploy-per-commit rule (CLAUDE.md §1.4.1)

Never stack multiple unshipped commits when any of them touch prisma/migrations/, apps/api/, or any service the API runs.

This rule was forced by two production outages:

  • Phase 0124 (2026-05-04) — 10.5h outage. An auto-resolve migration tripped on 41 phantom colors + 1 NULL canonicalStyleCode that hadn’t been caught pre-deploy. Recovery required flyctl mpg proxy + manual data scrub + prisma migrate resolve --rolled-back.
  • Phase 0125 (2026-05-06) — 14 stacked commits, deploy attempted only at the end. Three SQL bugs surfaced sequentially: apparel-only filter, gen_random_uuid inside SELECT DISTINCT, UNIQUE constraint on a many-to-one mapping. Each was a 5-min fix individually but compounded into multiple failed deploy cycles.

The lesson: the deploy is the verification. prisma validate + pnpm typecheck locally don’t catch SQL semantics, pgbouncer transaction boundaries, or prod-shape edge cases. Stacking commits compounds risk — a single broken migration blocks every subsequent deploy via _prisma_migrations failure tracking.

Cadence per CLAUDE.md §1.4.1:

  1. Commit.
  2. Push.
  3. If the commit touched migrations / API / backend services: flyctl deploy and verify release_command ... completed successfully before the next commit.
  4. Pure-frontend / pure-docs commits can stack — Vercel auto-deploys handle them.

Exception: when a long-running operator job (e.g. a 21h backfill) is in flight, deploys can be batched at the end to avoid killing the worker — per the 2026-05-19 batch-deploy strategy that kept Phase 0141 (split-tender) un-deployed for ~18h.

The deploy-strategy quirk

--strategy=immediate was forced by the Fly orchestration timeout lesson (docs/operations/fly-deploy-strategy.md):

auto_stop_machines = "stop" + min_machines_running = 1 races the rolling-restart health-check polling on a 2-machine fleet with low pre-launch traffic. release_command always succeeded; the timeout was on flyctl’s polling, not the actual deploy.

Switch back to --strategy=rolling when real customer traffic arrives post-launch.

The min-machines-running quirk

apps/api/fly.toml now has min_machines_running = 2 (bumped 2026-05-19). Was 1; that killed detached operator jobs (Phase 0117b refetch, Phase 0155 Gelato CA backfill) when Fly’s autoscaler saw no HTTP traffic on the worker machine. The Phase 0155 incident lost ~1.5h of backfill before we bumped this. Costs ~$10/mo more; eliminates operator-babysit overhead.

Post-deploy verification

Per docs/operations/migration-lock-recovery.md:

Post-deploy verification = curl /health AFTER rolling restart, not release_command success.

Forced by the 2026-05-18 Phase 0154 two-bug incident — release_command succeeded but the API failed to boot on a Nest DI lazy-type-import bug. Health check is the canonical “yes the system is alive.”

LAUNCH_CHECKLIST.md item: every Fly deploy ends with curl https://api.yourcustommerch.ca/api/v1/health returning 200.

The CI gate (CLAUDE.md §1.4.2)

Every push to main runs .github/workflows/ci.yml:

  • Typecheck across all packages (pnpm typecheck)
  • Lint (currency-literal, i18n parity, rules-of-hooks, typedRoutes)
  • Unit tests
  • Build
  • Integration tests (Postgres-backed, since 2026-05-19) — Postgres 15 service container + Prisma db push + pnpm --filter @merchos/api test:integration

The rule: after every push, check the CI run. A red CI badge that gets ignored is no signal at all. Forced by the 2026-05-17 incident where CI was red for 25 commits over ~17 hours because a currency-literal violation went unnoticed.

gh run list --workflow=ci.yml --limit 1 is the one-liner check. CI typically finishes in ~5 min.

Migrations + the CTE-shape rule

Prisma migrations under prisma/migrations/. Naming convention: YYYYMMDD[a-z]_phase_name.

Migrations run via release_command on every Fly deploy — a separate ephemeral machine spins up, runs prisma migrate deploy, then terminates. The API machines only start serving traffic after migrations complete.

The CTE-shape rule (2026-05-18 incident lesson):

Migrations that backfill via multi-table JOIN must use CTE shape (resolve target rows in WITH SELECT, then UPDATE FROM cte). Direct UPDATE T t SET … FROM X x INNER JOIN J j ON j.col = t.col is illegal in Postgres (E42P01) but passes SELECT-preview gates.

Forced by Phase 0154 Sub-phase 0116b — a UPDATE-FROM-INNER-JOIN backfill failed at deploy time, costing ~22 min of downtime. Recovery: rewrote the SQL as a CTE; redeployed; clean.

scripts/migration-dry-run.sh (LAUNCH_CHECKLIST 🔴 gate) flags this pattern + recommends the CTE rewrite. Run against prod via flyctl mpg proxy for any destructive auto-resolve migration.

The migration replay-from-scratch caveat (POST_DEPLOY_BACKLOG.md “Migration history can’t be replayed from scratch”):

A fresh database applying all migrations in lexicographic order hits ERROR: type "PricingModel" does not exist because 20260421_billing_discretion_and_payment_flow sorts BEFORE the enum-creating 20260421_vertical_gated_pricing_model (same day; underscore-after-letter ordering). Production is fine because migrations applied in author-time order. Both the CI e2e workflow and integration tests use prisma db push (applies current schema directly) instead of migrate deploy (replays history). Fix is low-priority — disaster recovery restores from pg_dump, not from migration replay.

Migration lock recovery

docs/operations/migration-lock-recovery.md. Prisma’s migration runner takes Postgres advisory lock 72707369 for the duration of migrate deploy. If the runner is killed mid-execution (Fly machine destroyed, network blip, OOM), the lock can stay held by an orphan backend → next deploy hangs forever.

apps/api/entrypoint.sh pre-flight check queries pg_locks WHERE objid=72707369; on stuck-lock, attempts pg_terminate_backend and prints actionable recovery SQL on fail.

Operator recovery:

  1. flyctl mpg proxy 9g6y30w99wlov5ml opens a local tunnel to prod Postgres.
  2. PGPASSWORD=<...> psql -h localhost -p 16380 -U fly-user -d fly-db
  3. SELECT pid, granted FROM pg_locks WHERE objid = 72707369;
  4. SELECT pg_terminate_backend(<pid>); for the granted holder.
  5. Re-deploy.

Hit this during Phase 0155 (2026-05-19) when the release_command machine got killed by the flyctl orchestration timeout while still holding the lock; documented above.

Postgres + pgbouncer

Fly Managed Postgres (flympg) cluster 9g6y30w99wlov5ml in yyz. App connects via pgbouncer (pgbouncer.9g6y30w99wlov5ml.flympg.net:5432) — session mode, not transaction mode (per docs/operations/db-connection.md).

Session mode is required because Prisma’s $transaction({ isolationLevel }) + FOR UPDATE locks need a single backend across the whole transaction. Transaction-mode pgbouncer would hand the next statement to a different backend mid-transaction — silently breaking the FOR UPDATE lock on SpendRecord.

This is the most subtle operational fact in MerchOS. Switching pgbouncer modes silently breaks the spend-limit invariant; tests would still pass (they go direct to Postgres in CI); production would have intermittent over-cap orders that no log would explain.

Backups + PITR

docs/operations/db-recovery.md. Fly Managed Postgres has built-in point-in-time recovery with daily snapshots + 7-day retention. PITR window: 5 min granularity within the last 7 days.

Operator drill: flyctl postgres restore against a target timestamp. The restore creates a NEW Postgres cluster; the operator swaps the DATABASE_URL secret + restarts the API.

Pre-launch we ran one DR drill (Phase 9). Post-launch the POST_DEPLOY_BACKLOG.md “Pre-revenue reliability hardening” item plans a quarterly drill.

Integration tests

docs/operations/integration-tests.md. Phase 0155 (2026-05-19) added Postgres-backed integration tests.

When to write one vs a unit test:

  • Unit test (matches *.spec.ts, mocks Prisma) — pure transformation, validators, formatters, predicates.
  • Integration test (matches *.int.spec.ts, real Prisma) — anything where correctness depends on database semantics: FK CASCADE / NoAction, SQL JOINs through Prisma include, multi-row state diffs, materialized views, multi-table semantics.

Running locally:

Terminal window
# Start a Postgres container (one-time)
docker run -d --name merchos-test-pg \
-e POSTGRES_USER=merchos -e POSTGRES_PASSWORD=dev \
-e POSTGRES_DB=merchos_test -p 5432:5432 postgres:15
# Run the suite
TEST_DATABASE_URL=postgresql://merchos:dev@localhost:5432/merchos_test \
pnpm --filter @merchos/api test:integration

TEST_DATABASE_URL is required — globalSetup refuses to run without it (prevents accidentally pointing at prod). setupIntegrationDb() registers a beforeEach that TRUNCATEs every non-_prisma_migrations table; maxWorkers: 1 serializes runs.

Today’s coverage: 32 tests across 5 suites (master.service.int.spec.ts, sku-variant-cell-state.service.int.spec.ts, spend-settings.int.spec.ts, split-tender-webhook.int.spec.ts, split-tender-refund.int.spec.ts). More to add — see POST_DEPLOY_BACKLOG.md “Test framework rollout.”

Observability — current state

Honest answer: partial.

LayerCurrentGap
Application logsFly’s stdout (flyctl logs) — JSON-lineNo log aggregation; grep-on-Fly only
Application error trackingSentry wired into the API (docs/operations/sentry-setup.md)Frontends use Sentry too; integration is solid
Uptime monitoringdocs/operations/uptime-monitoring-setup.mdExternal monitor pinging /health
Stripe webhook deliveryStripe dashboardNo proactive alert on delivery failures
Cron job execution@Cron logs in stdoutNo “this cron didn’t run today” alert
Catalog health (sync staleness, unresolvable CuratedStyles)PlatformInvariantService boot-time checkRuntime drift goes undetected until next boot
Margin drift (Gelato cost increases)Not implementedPost-launch — see POST_DEPLOY_BACKLOG.md
Performance / latencyVercel Analytics on frontends; nothing on APINo APM

Pre-launch we’re accepting these gaps. Post-launch the POST_DEPLOY_BACKLOG.md “Catalog health observability dashboard” + “Margin-erosion alert layer” items address the most operationally critical ones.

Runbooks

The docs/operations/ directory is the on-call reference:

RunbookWhen you read this
db-connection.mdTouching pgbouncer config or debugging transaction-mode behavior
db-recovery.mdPITR restore; data-loss event
fly-deploy-strategy.mdDeploy hangs / health-check polling timeout
integration-tests.mdAdding a *.int.spec.ts suite or debugging an integration test
migration-lock-recovery.mdStuck migration lock (pg_locks objid=72707369)
r2-cors-config.mdCloudflare R2 bucket CORS issues
sentry-setup.mdSentry configuration / DSN rotation
tier-swap-compensation.mdTier-swap rolled back; need to manually compensate Stripe
uptime-monitoring-setup.mdUptime monitor not paging on outage

Plus FIRST_ORDER_RUNBOOK.md at repo root for the first-real-customer end-to-end walk + the 168-line Printful supplement (added 2026-05-19).

What’s next


Canonical sources

Triggers for update

Update this chapter if you:

  • Add or remove a CLAUDE.md operational rule (today: §1.4.1 deploy-per-commit, §1.4.2 CI verification).
  • Change Fly fly.toml topology (min_machines_running, region, machine size).
  • Change pgbouncer mode (session → transaction or back) — this would be a regression on the spend-limit invariant; flag it.
  • Add or change the deploy strategy (immediaterolling).
  • Add a new operational runbook to docs/operations/.
  • Land a new observability layer (APM, log aggregation, cron-execution alerting).
  • Change the CTE-shape migration rule.
  • Add or remove a CI gate.
  • Change the integration-test pattern (today: *.int.spec.ts + prisma db push + maxWorkers=1).