Operations
Runbooks, health checks, dashboard routing, worker automation, and admin bootstrap
Core runbooks
- Release and rollback:
docs/release-rollback-runbook.md - Incident response:
docs/incident-response-playbook.md - Go-live checklist:
docs/go-live-checklist.md - Secret rotation drill:
docs/secret-rotation-drill.md - Backup/restore drill:
docs/backup-restore-drill.md - Pilot provider onboarding:
docs/pilot-provider-onboarding-runbook.md - Environment and secrets matrix:
docs/environment-secrets-matrix.md - Supabase bootstrap checklist:
docs/supabase-bootstrap.md
Health checks
- Basic:
/api/health - Optional DB check:
/api/health/db(requiresHEALTHCHECK_TOKENand request headerx-healthcheck-token) - Optional auth check:
/api/health/auth(requiresHEALTHCHECK_TOKENand request headerx-healthcheck-token)
RLS session context
Baseline RLS policies expect per-request session variables (set via set_config(...)). The web app sets these inside DB transactions using src/db/rls.ts (e.g., on /dashboard/admin).
If you want RLS to be a hard enforcement boundary, ensure DATABASE_URL uses a non-owner database role so it cannot bypass RLS.
For Supabase pooler runtime URLs with custom roles, use port 6543 (transaction mode).
Notes:
user_invitations,users, anduser_identitiesare RLS-protected.worker_heartbeatsis RLS-protected; worker writes and worker-health reads run with internalsystemRLS context.- Signup-time invite checks and domain user bootstrap run with an internal
systemRLS role set server-side in auth flows. user_invitationswrites are role-shape constrained by RLS (provider roles requireprovider_id; attorneys requirefirm_id; client/platform admin invites are unscoped).userswrites are role-shape constrained by RLS with the same scope rules (provider_admin/counselor-> provider-scoped,attorney-> firm-scoped,client/platform_admin-> unscoped).
Runtime security guardrail
Run this in staging/production runtime context to verify non-owner DB role safety, required RLS enablement, and worker_heartbeats system read/write behavior:
npm run ops:verify-runtime-securityNotes:
- The command expects
DATABASE_URLto match the deployed runtime role (non-owner). - Use
--allow-owneronly for local/admin checks where owner credentials are intentional.
Deployment health guardrail
Run this after deploy to validate web health plus worker health behavior:
npm run ops:verify-deployment-health -- --base-url https://staging.certaos.comNotes:
- Prefer passing
--base-url; otherwise the command resolves fromHEALTHCHECK_BASE_URL,BETTER_AUTH_BASE_URL,BETTER_AUTH_URL,NEXT_PUBLIC_BETTER_AUTH_URL, orVERCEL_URL. - Worker health mode defaults to
auto:- token mode when
HEALTHCHECK_TOKENis available - anonymous mode otherwise
- token mode when
- Use
--worker-health-mode anonymousto always verify/api/health/workeris protected (HTTP 404), regardless of local token values. - Use
--worker-health-mode tokento require token-based worker health verification (HTTP 200+{ worker: true }). - Optional override:
--healthcheck-token <token>to pass a one-off token without changing local env files.
Performance baseline guardrail
Run this after deploy (or on a schedule) to verify endpoint availability and latency baseline:
npm run ops:verify-performance -- --base-url https://staging.certaos.comNotes:
- Default endpoints:
/,/api/health,/providers - Default sample count:
3requests per endpoint - Default thresholds:
--max-error-rate 0--max-avg-ms 2500--max-p95-ms 5000
- Override targets and thresholds as needed:
--endpoints /,/api/health,/docs--samples 5--max-avg-ms 3500 --max-p95-ms 7000
GitHub Actions deployment checks
Use workflow .github/workflows/deployment-health.yml for recurring/manual deploy smoke checks across staging and production domains.
If a check fails, the workflow automatically opens/updates a GitHub incident issue and auto-closes it after both environments recover.
Use workflow .github/workflows/performance-baseline.yml for recurring/manual staging+production performance baseline checks (with the same incident open/update/close pattern).
Optional GitHub repository secrets (for token-auth worker health validation):
STAGING_HEALTHCHECK_TOKENPRODUCTION_HEALTHCHECK_TOKEN
Migration/admin DB URL
For schema migrations and RLS SQL apply, keep admin credentials separate from runtime:
DATABASE_URL: runtime app/worker role (non-owner)MIGRATION_DATABASE_URL: owner/admin role used bynpm run db:migrateandnpm run db:rls:apply
MIGRATION_DATABASE_SSL_INSECURE is optional and falls back to DATABASE_SSL_INSECURE when unset.
Provider marketplace
Public provider pages (like /providers and /:providerSlug/enroll) rely on a public-read RLS allowance for providers where approval_status='approved' and is_active=true.
/:providerSlug/enroll supports invite-based enrollment resume by accepting an existing enrollment ID and redirecting to /course/<enrollmentId>.
It also supports direct public intake capture via Start New Enrollment, which persists enrollment_requests rows for provider triage (individual or joint filing mode).
Provider applications
Prospective providers can submit an application at /become-a-provider. Platform admins review these in /dashboard/admin under Provider Applications:
- Update the application
status(submitted,under_review,approved,denied) - Use Create Provider to create and link a
providersrow from the application (defaultsapproval_status='under_review'so it is not public until explicitly approved/activated) - Use Approve + Activate to set the linked provider to
approval_status='approved'andis_active=true(makes it eligible to show on/providers) - Use Invite Provider Admin to create a
provider_admininvite for the application contact email (provider-scoped)
Phase 0 operational baseline
- Branch protection enabled on
main - CI checks required before merge
- Docs updates required for code changes
- Vercel deployments created for staging/prod (use per-deployment
VERCEL_URLfor auth origin on previews)
Dashboard routing
/dashboard is a role router: after sign-in/sign-up it routes to the correct role dashboard (e.g. /dashboard/provider, /dashboard/admin).
Course sessions
When a client clicks Begin Course on /course/<enrollmentId>, the app creates a course_sessions row (scaffold) and transitions the enrollment to in_progress.
Before Accept Invite transitions an enrollment from invited -> enrolled, the client must acknowledge disclosures and satisfy payment/waiver preconditions:
- disclosure acknowledgment is recorded to
audit_logs - payment is either waived (
enrollments.has_fee_waiver=true) or confirmed by a successfulpaymentsrow (payments.status='succeeded') - if an enrollment is part of a joint household (
enrollments.household_id), one successful payment in that household satisfies payment preconditions for both enrollments
Stripe course-fee checkout now uses:
- checkout start from
/course/<enrollmentId> - webhook confirmation endpoint:
/api/payments/stripe/webhook - env vars:
STRIPE_SECRET_KEY,STRIPE_WEBHOOK_SECRET, and optionalCOURSE_FEE_CENTS_DEFAULT(defaults to5000, capped at5000)
Temporary pre-Stripe fallback:
- if
MANUAL_PAYMENT_FALLBACK_ENABLED=trueandSTRIPE_SECRET_KEYis unset, invited clients can record a manual payment on/course/<enrollmentId> - this writes a real
paymentsrow withstatus='succeeded'(so compliance preconditions still rely on persisted payment state) - keep this disabled in production once Stripe is live
While an in-progress course page is open, the client sends a best-effort heartbeat to /api/course/heartbeat to update course_sessions.last_seen_at and increment total_seconds (capped per ping).
If the gap between heartbeats is >120 seconds, the gap is treated as idle time and does not count toward total_seconds (recorded in idle_seconds).
Minimum time gating is enforced using server-calculated time:
- CC: 60 minutes (3,600 seconds)
- DE: 120 minutes (7,200 seconds)
If a course action is blocked by a precondition, the course page shows a reason (e.g. cc_minimum_time_required) and the verified-time vs required-time snapshot.
Audit logs
Enrollment status transitions are recorded as audit_logs rows. Platform admins can review recent events in /dashboard/admin under Recent Audit Logs.
Email notifications
Invite and certificate notification attempts are recorded in notification_deliveries and shown in /dashboard/admin under Recent Notification Deliveries.
- Provider: Resend (
RESEND_API_KEY) - Sender identity:
EMAIL_FROM(falls back toCertaOS <noreply@send.certaos.com>) - Current events:
user.invite.*(platform admin/provider counselor/admin invite flows)enrollment.client_invite(attorney Invite Client action)certificate.issued(worker certificate issuer and manual Issue Certificate action)
Email smoke test
Use this command to validate Resend credentials and verify a delivery attempt is persisted:
npm.cmd run ops:test-email -- --env staging --to you@example.com
npm.cmd run ops:test-email -- --env production --to you@example.comNotes:
--envsupportslocal,staging, andproductionand auto-selects.env.<env>.local.- Optional override:
--dotenv <path>for custom env file locations.
Admin Bootstrap (Invites and Roles)
You can manage invites either via the platform admin page or via CLI scripts.
Resetting test accounts (Staging/Dev)
If staging signup says "user already exists" and you don't know the password yet (no reset-email flow configured), you can delete the Better Auth identity so the email can sign up again:
$env:DOTENV_CONFIG_PATH='.env.staging.local'
node scripts/reset-auth-user.js --email you@example.com --yes true --delete-domain trueThen re-issue an invite (if needed) and sign up again at /sign-up.
Seeding demo data (Staging)
To populate staging with a demo provider + sample entities (firm/attorney/client/enrollment) and invites for test accounts:
$env:DOTENV_CONFIG_PATH='.env.staging.local'
npm run seed:stagingPlatform Admin UI
- Sign in as a
platform_admin, then open/dashboard/admin. - Use Create Invite to allow a specific email to sign up in production (unless
ALLOW_PUBLIC_SIGNUP=true). - Use Billing Snapshot to monitor payment state (
succeeded,pending, failures), manual fallback payments, and fee-waiver counts. - Billing Snapshot also surfaces household rollups (
Joint Enrollments,Joint Households,Paid Households). - Use Recent Payments for record-level review (provider/client/enrollment, source, status, and classification).
- Recent Payments includes a household column to track joint-pair payment coverage.
- Use Recent Households to monitor grouped joint cases (members, statuses, payment coverage, and direct link to course records).
- Use Recent Public Enrollment Requests to monitor consumer intake submitted from provider public enrollment pages.
- Intake request status can be updated inline (
submitted,reviewing,invited,rejected,cancelled) from both admin and provider dashboards. - Platform admins and provider admins (within provider scope) can use Convert + Invite to turn a request into active enrollment record(s) and send enrollment invite emails.
- Payments and households support query filters (
payments_q,households_q) to speed operational triage. - For role-scoped onboarding:
provider_adminandcounselorinvites should include a Provider.attorneyinvites should include a Firm.
- Use Providers to approve/activate providers for public display in
/providers. - Use DE Court Filing Queue to review DE enrollments awaiting filing (status
cert_issued) and click Queue Now to force an immediate worker retry. - Use DE Deadlines to review DE enrollments with deadlines inside 30 days (or overdue), including the last alert level emitted by the worker. Use Save to edit a deadline and reset alerts.
- Use Revoke to remove a pending invite.
Provider Admin UI
- Sign in as a
provider_admin, then open/dashboard/provider. - Use Invite Counselor to create counselor invites scoped to your provider.
- Use Public Enrollment Requests to triage intake requests submitted from your public
/:providerSlug/enrollpage. - Update request status inline as intake progresses (
submitted->reviewing->invited/rejected/cancelled). - Use Convert + Invite to create direct-intake enrollment record(s) and send client invite email(s) from provider triage.
- View recent enrollments for your provider and open the counselor queue.
- Recent enrollments now include household-aware payment labels (
waived,household paid,paid,client pays) and household IDs when present. - Provider recent enrollments now include an
enrollments_qsearch filter for household/client/payment triage. - Use DE Deadlines to view DE items due within 30 days (or overdue) for your provider, including the last alert level sent by the worker.
- Use DE Court Filing Queue to view DE enrollments in
cert_issuedthat are not yet filed with court (escalate to a platform admin to re-queue).
Attorney UI
- Sign in as an
attorney, then open/dashboard/attorney. - Use Create Enrollment (client email is required).
- For joint cases:
- Set Filing Mode to
joint. - Provide spouse/joint-filer identity fields (first/last name, DOB, SSN last 4, email).
- The app links the two client records (
clients.joint_filer_id) and creates a second enrollment for the spouse (separate certificate path). - The two enrollments are grouped by
enrollments.household_idso a single payment unlocks invite acceptance for both filers.
- Set Filing Mode to
- For DE enrollments:
- Chapter 7 requires either 341 Meeting Date or DE Filing Deadline.
- If Chapter 7 deadline is blank and 341 date is set, the app auto-calculates
de_filing_deadline = meeting_341_date + 60 days. - Chapter 11/13 requires an explicit DE Filing Deadline.
- In Recent Enrollments, use Invite Client to create/refresh a
user_invitationsrow for that email (roleclient). - For joint household enrollments, use Invite Household to sync invites for both spouse enrollments in one action.
- Payment labels in Recent Enrollments reflect household scope:
waivedhousehold paid(a spouse enrollment has a successful payment)paid(this enrollment has a successful payment)client pays
- The client can then sign up at
/sign-up(production is invite-only by default), view enrollments at/dashboard/client, and open the course at/course/<enrollmentId>.
Counselor UI
- Sign in as a
counselor, then open/dashboard/counselor. - Work items appear when a client submits an in-progress enrollment for counselor/escalation review from
/course/<enrollmentId>. - Use Complete to transition the enrollment to
completed. - The worker automatically issues certificates by transitioning
completed -> cert_issued(certificate generation is a scaffold until CGS integration lands). - If automation is disabled, you can still use Issue Certificate on
/course/<enrollmentId>to transitioncompleted -> cert_issued. - Issued certificates show up on
/dashboard/client,/dashboard/attorney, and/dashboard/provider(issued-at timestamp).
CLI (Local/Staging)
From repo root:
- Create or refresh an invite:
npm run user:invite -- --email you@example.com --role platform_admin --expires-in-days 14
- Optional scoping:
--provider-id <uuid|null>/--firm-id <uuid|null> - Promote an existing domain user (after signing in once and visiting
/dashboard):npm run user:set-role -- --email you@example.com --role platform_admin
- Optional scoping:
--provider-id <uuid|null>/--firm-id <uuid|null>
Role scope rules (enforced by scripts + RLS):
provider_admin/counselor: requireprovider_idand nofirm_idattorney: requirefirm_idand noprovider_idclient/platform_admin: must be unscoped (provider_idandfirm_idnull)
These scripts read DATABASE_URL from .env.local by default (override with DOTENV_CONFIG_PATH).
Vercel Env Notes (Windows)
When using vercel env add from PowerShell, piped values may include a trailing newline. The app trims DATABASE_URL and DATABASE_SSL_INSECURE at runtime, but it is still best to paste values directly in Vercel's UI or ensure your CLI input has no extra whitespace.
Worker automation (Certificates)
The worker runs a certificate issuer cron that scans for completed enrollments and issues certificates (cert_issued) in an audited, idempotent way.
Requirements:
- Railway deploys this repo with
railway.jsonso the process starts asnpm run worker. - Create a dedicated domain user to act as the worker (role
platform_admin):
$env:DOTENV_CONFIG_PATH='.env.staging.local'
npm.cmd run worker:create-actor- Set
WORKER_ACTOR_USER_IDandWORKER_ACTOR_ROLE=platform_adminin the worker environment (staging and production).
Runtime behavior and tuning
- The issuer retries transient transition failures with exponential backoff.
- If a stale backlog exists (
completedolder than SLA threshold), the worker logs a warning and emits a periodic alert. - Optional webhook alert sink:
WORKER_ALERT_WEBHOOK_URL(JSONPOSTpayload). - Optional Telegram alert sink:
WORKER_ALERT_TELEGRAM_BOT_TOKEN+WORKER_ALERT_TELEGRAM_CHAT_ID. - Optional quiet-hours suppression for backlog alerts (UTC-based).
- Optional once-daily digest alert (UTC hour/minute).
Config knobs (worker env):
CERT_ISSUER_POLL_INTERVAL_MS(default60000)CERT_ISSUER_BATCH_SIZE(default25)CERT_ISSUER_MAX_BATCHES_PER_TICK(default5)CERT_ISSUER_RETRY_ATTEMPTS(default3)CERT_ISSUER_RETRY_BASE_DELAY_MS(default500)CERT_ISSUER_STALE_COMPLETED_SLA_HOURS(default6)CERT_ISSUER_ALERT_COOLDOWN_MS(default900000)CERT_ISSUER_ALERT_QUIET_HOURS_ENABLED(defaultfalse)CERT_ISSUER_ALERT_QUIET_HOURS_START_UTC(default0)CERT_ISSUER_ALERT_QUIET_HOURS_END_UTC(default0)CERT_ISSUER_DAILY_DIGEST_ENABLED(defaultfalse)CERT_ISSUER_DAILY_DIGEST_HOUR_UTC(default14)CERT_ISSUER_DAILY_DIGEST_MINUTE_UTC(default0)
Quiet-hours notes:
- Quiet hours only suppress immediate backlog alerts (
certificate_backlog_alert). - Daily digest (
certificate_backlog_daily_digest) still sends on schedule when enabled. START_UTC == END_UTCis treated as no quiet window.
Telegram wiring
- Create a bot with
@BotFatherand copy the bot token. - Add the bot to your target channel/group (or DM it first).
- Capture the chat ID and set:
WORKER_ALERT_TELEGRAM_BOT_TOKENWORKER_ALERT_TELEGRAM_CHAT_ID
The worker sends Telegram messages for certificate backlog alerts when those two env vars are present.
Manual alert smoke test
Use this to verify webhook/Telegram channels without waiting for a real backlog event:
npm.cmd run worker:test-alert