CertaOS
Project

Operations

Runbooks, health checks, dashboard routing, worker automation, and admin bootstrap

Core runbooks

  • Release and rollback: docs/release-rollback-runbook.md
  • Incident response: docs/incident-response-playbook.md
  • Go-live checklist: docs/go-live-checklist.md
  • Secret rotation drill: docs/secret-rotation-drill.md
  • Backup/restore drill: docs/backup-restore-drill.md
  • Pilot provider onboarding: docs/pilot-provider-onboarding-runbook.md
  • Environment and secrets matrix: docs/environment-secrets-matrix.md
  • Supabase bootstrap checklist: docs/supabase-bootstrap.md

Health checks

  • Basic: /api/health
  • Optional DB check: /api/health/db (requires HEALTHCHECK_TOKEN and request header x-healthcheck-token)
  • Optional auth check: /api/health/auth (requires HEALTHCHECK_TOKEN and request header x-healthcheck-token)

RLS session context

Baseline RLS policies expect per-request session variables (set via set_config(...)). The web app sets these inside DB transactions using src/db/rls.ts (e.g., on /dashboard/admin).

If you want RLS to be a hard enforcement boundary, ensure DATABASE_URL uses a non-owner database role so it cannot bypass RLS.

For Supabase pooler runtime URLs with custom roles, use port 6543 (transaction mode).

Notes:

  • user_invitations, users, and user_identities are RLS-protected.
  • worker_heartbeats is RLS-protected; worker writes and worker-health reads run with internal system RLS context.
  • Signup-time invite checks and domain user bootstrap run with an internal system RLS role set server-side in auth flows.
  • user_invitations writes are role-shape constrained by RLS (provider roles require provider_id; attorneys require firm_id; client/platform admin invites are unscoped).
  • users writes are role-shape constrained by RLS with the same scope rules (provider_admin/counselor -> provider-scoped, attorney -> firm-scoped, client/platform_admin -> unscoped).

Runtime security guardrail

Run this in staging/production runtime context to verify non-owner DB role safety, required RLS enablement, and worker_heartbeats system read/write behavior:

npm run ops:verify-runtime-security

Notes:

  • The command expects DATABASE_URL to match the deployed runtime role (non-owner).
  • Use --allow-owner only for local/admin checks where owner credentials are intentional.

Deployment health guardrail

Run this after deploy to validate web health plus worker health behavior:

npm run ops:verify-deployment-health -- --base-url https://staging.certaos.com

Notes:

  • Prefer passing --base-url; otherwise the command resolves from HEALTHCHECK_BASE_URL, BETTER_AUTH_BASE_URL, BETTER_AUTH_URL, NEXT_PUBLIC_BETTER_AUTH_URL, or VERCEL_URL.
  • Worker health mode defaults to auto:
    • token mode when HEALTHCHECK_TOKEN is available
    • anonymous mode otherwise
  • Use --worker-health-mode anonymous to always verify /api/health/worker is protected (HTTP 404), regardless of local token values.
  • Use --worker-health-mode token to require token-based worker health verification (HTTP 200 + { worker: true }).
  • Optional override: --healthcheck-token <token> to pass a one-off token without changing local env files.

Performance baseline guardrail

Run this after deploy (or on a schedule) to verify endpoint availability and latency baseline:

npm run ops:verify-performance -- --base-url https://staging.certaos.com

Notes:

  • Default endpoints: /, /api/health, /providers
  • Default sample count: 3 requests per endpoint
  • Default thresholds:
    • --max-error-rate 0
    • --max-avg-ms 2500
    • --max-p95-ms 5000
  • Override targets and thresholds as needed:
    • --endpoints /,/api/health,/docs
    • --samples 5
    • --max-avg-ms 3500 --max-p95-ms 7000

GitHub Actions deployment checks

Use workflow .github/workflows/deployment-health.yml for recurring/manual deploy smoke checks across staging and production domains. If a check fails, the workflow automatically opens/updates a GitHub incident issue and auto-closes it after both environments recover.

Use workflow .github/workflows/performance-baseline.yml for recurring/manual staging+production performance baseline checks (with the same incident open/update/close pattern).

Optional GitHub repository secrets (for token-auth worker health validation):

  • STAGING_HEALTHCHECK_TOKEN
  • PRODUCTION_HEALTHCHECK_TOKEN

Migration/admin DB URL

For schema migrations and RLS SQL apply, keep admin credentials separate from runtime:

  • DATABASE_URL: runtime app/worker role (non-owner)
  • MIGRATION_DATABASE_URL: owner/admin role used by npm run db:migrate and npm run db:rls:apply

MIGRATION_DATABASE_SSL_INSECURE is optional and falls back to DATABASE_SSL_INSECURE when unset.

Provider marketplace

Public provider pages (like /providers and /:providerSlug/enroll) rely on a public-read RLS allowance for providers where approval_status='approved' and is_active=true.

/:providerSlug/enroll supports invite-based enrollment resume by accepting an existing enrollment ID and redirecting to /course/<enrollmentId>. It also supports direct public intake capture via Start New Enrollment, which persists enrollment_requests rows for provider triage (individual or joint filing mode).

Provider applications

Prospective providers can submit an application at /become-a-provider. Platform admins review these in /dashboard/admin under Provider Applications:

  • Update the application status (submitted, under_review, approved, denied)
  • Use Create Provider to create and link a providers row from the application (defaults approval_status='under_review' so it is not public until explicitly approved/activated)
  • Use Approve + Activate to set the linked provider to approval_status='approved' and is_active=true (makes it eligible to show on /providers)
  • Use Invite Provider Admin to create a provider_admin invite for the application contact email (provider-scoped)

Phase 0 operational baseline

  • Branch protection enabled on main
  • CI checks required before merge
  • Docs updates required for code changes
  • Vercel deployments created for staging/prod (use per-deployment VERCEL_URL for auth origin on previews)

Dashboard routing

/dashboard is a role router: after sign-in/sign-up it routes to the correct role dashboard (e.g. /dashboard/provider, /dashboard/admin).

Course sessions

When a client clicks Begin Course on /course/<enrollmentId>, the app creates a course_sessions row (scaffold) and transitions the enrollment to in_progress.

Before Accept Invite transitions an enrollment from invited -> enrolled, the client must acknowledge disclosures and satisfy payment/waiver preconditions:

  • disclosure acknowledgment is recorded to audit_logs
  • payment is either waived (enrollments.has_fee_waiver=true) or confirmed by a successful payments row (payments.status='succeeded')
  • if an enrollment is part of a joint household (enrollments.household_id), one successful payment in that household satisfies payment preconditions for both enrollments

Stripe course-fee checkout now uses:

  • checkout start from /course/<enrollmentId>
  • webhook confirmation endpoint: /api/payments/stripe/webhook
  • env vars: STRIPE_SECRET_KEY, STRIPE_WEBHOOK_SECRET, and optional COURSE_FEE_CENTS_DEFAULT (defaults to 5000, capped at 5000)

Temporary pre-Stripe fallback:

  • if MANUAL_PAYMENT_FALLBACK_ENABLED=true and STRIPE_SECRET_KEY is unset, invited clients can record a manual payment on /course/<enrollmentId>
  • this writes a real payments row with status='succeeded' (so compliance preconditions still rely on persisted payment state)
  • keep this disabled in production once Stripe is live

While an in-progress course page is open, the client sends a best-effort heartbeat to /api/course/heartbeat to update course_sessions.last_seen_at and increment total_seconds (capped per ping).

If the gap between heartbeats is >120 seconds, the gap is treated as idle time and does not count toward total_seconds (recorded in idle_seconds).

Minimum time gating is enforced using server-calculated time:

  • CC: 60 minutes (3,600 seconds)
  • DE: 120 minutes (7,200 seconds)

If a course action is blocked by a precondition, the course page shows a reason (e.g. cc_minimum_time_required) and the verified-time vs required-time snapshot.

Audit logs

Enrollment status transitions are recorded as audit_logs rows. Platform admins can review recent events in /dashboard/admin under Recent Audit Logs.

Email notifications

Invite and certificate notification attempts are recorded in notification_deliveries and shown in /dashboard/admin under Recent Notification Deliveries.

  • Provider: Resend (RESEND_API_KEY)
  • Sender identity: EMAIL_FROM (falls back to CertaOS <noreply@send.certaos.com>)
  • Current events:
    • user.invite.* (platform admin/provider counselor/admin invite flows)
    • enrollment.client_invite (attorney Invite Client action)
    • certificate.issued (worker certificate issuer and manual Issue Certificate action)

Email smoke test

Use this command to validate Resend credentials and verify a delivery attempt is persisted:

npm.cmd run ops:test-email -- --env staging --to you@example.com
npm.cmd run ops:test-email -- --env production --to you@example.com

Notes:

  • --env supports local, staging, and production and auto-selects .env.<env>.local.
  • Optional override: --dotenv <path> for custom env file locations.

Admin Bootstrap (Invites and Roles)

You can manage invites either via the platform admin page or via CLI scripts.

Resetting test accounts (Staging/Dev)

If staging signup says "user already exists" and you don't know the password yet (no reset-email flow configured), you can delete the Better Auth identity so the email can sign up again:

$env:DOTENV_CONFIG_PATH='.env.staging.local'
node scripts/reset-auth-user.js --email you@example.com --yes true --delete-domain true

Then re-issue an invite (if needed) and sign up again at /sign-up.

Seeding demo data (Staging)

To populate staging with a demo provider + sample entities (firm/attorney/client/enrollment) and invites for test accounts:

$env:DOTENV_CONFIG_PATH='.env.staging.local'
npm run seed:staging

Platform Admin UI

  • Sign in as a platform_admin, then open /dashboard/admin.
  • Use Create Invite to allow a specific email to sign up in production (unless ALLOW_PUBLIC_SIGNUP=true).
  • Use Billing Snapshot to monitor payment state (succeeded, pending, failures), manual fallback payments, and fee-waiver counts.
  • Billing Snapshot also surfaces household rollups (Joint Enrollments, Joint Households, Paid Households).
  • Use Recent Payments for record-level review (provider/client/enrollment, source, status, and classification).
  • Recent Payments includes a household column to track joint-pair payment coverage.
  • Use Recent Households to monitor grouped joint cases (members, statuses, payment coverage, and direct link to course records).
  • Use Recent Public Enrollment Requests to monitor consumer intake submitted from provider public enrollment pages.
  • Intake request status can be updated inline (submitted, reviewing, invited, rejected, cancelled) from both admin and provider dashboards.
  • Platform admins and provider admins (within provider scope) can use Convert + Invite to turn a request into active enrollment record(s) and send enrollment invite emails.
  • Payments and households support query filters (payments_q, households_q) to speed operational triage.
  • For role-scoped onboarding:
    • provider_admin and counselor invites should include a Provider.
    • attorney invites should include a Firm.
  • Use Providers to approve/activate providers for public display in /providers.
  • Use DE Court Filing Queue to review DE enrollments awaiting filing (status cert_issued) and click Queue Now to force an immediate worker retry.
  • Use DE Deadlines to review DE enrollments with deadlines inside 30 days (or overdue), including the last alert level emitted by the worker. Use Save to edit a deadline and reset alerts.
  • Use Revoke to remove a pending invite.

Provider Admin UI

  • Sign in as a provider_admin, then open /dashboard/provider.
  • Use Invite Counselor to create counselor invites scoped to your provider.
  • Use Public Enrollment Requests to triage intake requests submitted from your public /:providerSlug/enroll page.
  • Update request status inline as intake progresses (submitted -> reviewing -> invited/rejected/cancelled).
  • Use Convert + Invite to create direct-intake enrollment record(s) and send client invite email(s) from provider triage.
  • View recent enrollments for your provider and open the counselor queue.
  • Recent enrollments now include household-aware payment labels (waived, household paid, paid, client pays) and household IDs when present.
  • Provider recent enrollments now include an enrollments_q search filter for household/client/payment triage.
  • Use DE Deadlines to view DE items due within 30 days (or overdue) for your provider, including the last alert level sent by the worker.
  • Use DE Court Filing Queue to view DE enrollments in cert_issued that are not yet filed with court (escalate to a platform admin to re-queue).

Attorney UI

  • Sign in as an attorney, then open /dashboard/attorney.
  • Use Create Enrollment (client email is required).
  • For joint cases:
    • Set Filing Mode to joint.
    • Provide spouse/joint-filer identity fields (first/last name, DOB, SSN last 4, email).
    • The app links the two client records (clients.joint_filer_id) and creates a second enrollment for the spouse (separate certificate path).
    • The two enrollments are grouped by enrollments.household_id so a single payment unlocks invite acceptance for both filers.
  • For DE enrollments:
    • Chapter 7 requires either 341 Meeting Date or DE Filing Deadline.
    • If Chapter 7 deadline is blank and 341 date is set, the app auto-calculates de_filing_deadline = meeting_341_date + 60 days.
    • Chapter 11/13 requires an explicit DE Filing Deadline.
  • In Recent Enrollments, use Invite Client to create/refresh a user_invitations row for that email (role client).
  • For joint household enrollments, use Invite Household to sync invites for both spouse enrollments in one action.
  • Payment labels in Recent Enrollments reflect household scope:
    • waived
    • household paid (a spouse enrollment has a successful payment)
    • paid (this enrollment has a successful payment)
    • client pays
  • The client can then sign up at /sign-up (production is invite-only by default), view enrollments at /dashboard/client, and open the course at /course/<enrollmentId>.

Counselor UI

  • Sign in as a counselor, then open /dashboard/counselor.
  • Work items appear when a client submits an in-progress enrollment for counselor/escalation review from /course/<enrollmentId>.
  • Use Complete to transition the enrollment to completed.
  • The worker automatically issues certificates by transitioning completed -> cert_issued (certificate generation is a scaffold until CGS integration lands).
  • If automation is disabled, you can still use Issue Certificate on /course/<enrollmentId> to transition completed -> cert_issued.
  • Issued certificates show up on /dashboard/client, /dashboard/attorney, and /dashboard/provider (issued-at timestamp).

CLI (Local/Staging)

From repo root:

  • Create or refresh an invite:
    • npm run user:invite -- --email you@example.com --role platform_admin --expires-in-days 14
  • Optional scoping: --provider-id <uuid|null> / --firm-id <uuid|null>
  • Promote an existing domain user (after signing in once and visiting /dashboard):
    • npm run user:set-role -- --email you@example.com --role platform_admin
  • Optional scoping: --provider-id <uuid|null> / --firm-id <uuid|null>

Role scope rules (enforced by scripts + RLS):

  • provider_admin / counselor: require provider_id and no firm_id
  • attorney: require firm_id and no provider_id
  • client / platform_admin: must be unscoped (provider_id and firm_id null)

These scripts read DATABASE_URL from .env.local by default (override with DOTENV_CONFIG_PATH).

Vercel Env Notes (Windows)

When using vercel env add from PowerShell, piped values may include a trailing newline. The app trims DATABASE_URL and DATABASE_SSL_INSECURE at runtime, but it is still best to paste values directly in Vercel's UI or ensure your CLI input has no extra whitespace.

Worker automation (Certificates)

The worker runs a certificate issuer cron that scans for completed enrollments and issues certificates (cert_issued) in an audited, idempotent way.

Requirements:

  • Railway deploys this repo with railway.json so the process starts as npm run worker.
  • Create a dedicated domain user to act as the worker (role platform_admin):
$env:DOTENV_CONFIG_PATH='.env.staging.local'
npm.cmd run worker:create-actor
  • Set WORKER_ACTOR_USER_ID and WORKER_ACTOR_ROLE=platform_admin in the worker environment (staging and production).

Runtime behavior and tuning

  • The issuer retries transient transition failures with exponential backoff.
  • If a stale backlog exists (completed older than SLA threshold), the worker logs a warning and emits a periodic alert.
  • Optional webhook alert sink: WORKER_ALERT_WEBHOOK_URL (JSON POST payload).
  • Optional Telegram alert sink: WORKER_ALERT_TELEGRAM_BOT_TOKEN + WORKER_ALERT_TELEGRAM_CHAT_ID.
  • Optional quiet-hours suppression for backlog alerts (UTC-based).
  • Optional once-daily digest alert (UTC hour/minute).

Config knobs (worker env):

  • CERT_ISSUER_POLL_INTERVAL_MS (default 60000)
  • CERT_ISSUER_BATCH_SIZE (default 25)
  • CERT_ISSUER_MAX_BATCHES_PER_TICK (default 5)
  • CERT_ISSUER_RETRY_ATTEMPTS (default 3)
  • CERT_ISSUER_RETRY_BASE_DELAY_MS (default 500)
  • CERT_ISSUER_STALE_COMPLETED_SLA_HOURS (default 6)
  • CERT_ISSUER_ALERT_COOLDOWN_MS (default 900000)
  • CERT_ISSUER_ALERT_QUIET_HOURS_ENABLED (default false)
  • CERT_ISSUER_ALERT_QUIET_HOURS_START_UTC (default 0)
  • CERT_ISSUER_ALERT_QUIET_HOURS_END_UTC (default 0)
  • CERT_ISSUER_DAILY_DIGEST_ENABLED (default false)
  • CERT_ISSUER_DAILY_DIGEST_HOUR_UTC (default 14)
  • CERT_ISSUER_DAILY_DIGEST_MINUTE_UTC (default 0)

Quiet-hours notes:

  • Quiet hours only suppress immediate backlog alerts (certificate_backlog_alert).
  • Daily digest (certificate_backlog_daily_digest) still sends on schedule when enabled.
  • START_UTC == END_UTC is treated as no quiet window.

Telegram wiring

  1. Create a bot with @BotFather and copy the bot token.
  2. Add the bot to your target channel/group (or DM it first).
  3. Capture the chat ID and set:
    • WORKER_ALERT_TELEGRAM_BOT_TOKEN
    • WORKER_ALERT_TELEGRAM_CHAT_ID

The worker sends Telegram messages for certificate backlog alerts when those two env vars are present.

Manual alert smoke test

Use this to verify webhook/Telegram channels without waiting for a real backlog event:

npm.cmd run worker:test-alert