Secret Hygiene — Architecture Overview

The ops-infra runbook is the source of truth; this page is the orienting index. Full details: ops-infra/runbooks/secret-hygiene-playbook.md

Why it matters

Every env var that reaches a running service must trace back to a line in a git-committed manifest or workflow file. When engineers set configuration out-of-band — via kubectl set env, a one-off SSH session, or an ad-hoc Compose file edit — the cluster drifts silently away from the repo. BL-223 surfaced this concretely: founderyos-api's running Deployment had 20+ env vars that existed nowhere in git. A naive kubectl apply -f deployment.yaml would have silently dropped every one of them, replacing the pod with whatever the repo actually declared and losing live configuration with no warning. Many services "tolerate" missing env at boot — they start fine, then fail at request time. That silent regression is the deploy-strip risk.

The second problem is exposure. Plaintext secrets on a live pod are visible to anyone with kubectl describe deploy or kubectl get deploy -o yaml. There is no git history explaining when a value was set, by whom, or why. Platform-wide, PLATFORM-009 standardized three patterns that close both gaps: every var is declared in the manifest (via valueFrom), rendered from GitHub Secrets/Variables at deploy time, and never typed directly onto a running container.

The three patterns

Pattern	When to use	Storage	Manifest reference
A — k8s `secretKeyRef` / `configMapKeyRef`	Service runs as a k3s Deployment in the `platform` or `founderyos` namespace on AX42-U	Sensitive → GitHub repo Secret; config → GitHub repo Variable	`valueFrom.secretKeyRef` / `valueFrom.configMapKeyRef`
B — Static frontend bundle	Compiled SPA (Vite/Next.js) served by nginx with no Node runtime at request time	GitHub repo Variable (never Secret — `VITE_*` values ship in the JS bundle and are public)	No runtime manifest env block; values are baked in at build time via `--build-arg`
C — Docker Compose on a VPS	Service runs under `docker compose` on a standalone Hetzner VPS (today: oracle-bridge on `65.21.149.226`)	GitHub repo Secret / Variable	`.env.staging` rendered by the deploy workflow and scp'd to the VPS; `.env.` is gitignored, `.env..example` is committed

Reference implementation: payment-gateway (PLATFORM-007.1 + BL-244) is the canonical Pattern A service. Its k8s/deployment.yaml and .github/workflows/deploy-staging.yml show the full render-then-reference cycle: the deploy workflow creates/updates the k8s Secret and ConfigMap with --dry-run=client -o yaml | kubectl apply, then the manifest references every value via valueFrom — no inline secret values, no kubectl set env, no drift.

Pattern B in practice: founderyos-dashboard (PLATFORM-009.3) completed its audit with manifest 0 / live 0 / drift 0 — the expected outcome for a static frontend. If you see a VITE_* variable configured as a k8s Secret, that is a smell: either it is not a real secret and should be a Variable, or it does not belong in the bundle at all.

How to apply

New service: Pick the pattern that matches your runtime (see the decision tree in the playbook). Wire GitHub storage and the deploy-time render step before the first deploy. Do not defer secrets hygiene to a follow-up story.
Existing service with drift: Run ops-infra/scripts/audit-env-drift.sh <service> to enumerate every live-only var. Classify each one (sensitive → rotate before migration; config → ConfigMap or .env entry), then migrate per the playbook's migration checklist. Zero live-only vars is the acceptance criterion — not "the pod booted."
Rotating a secret: Follow the rotation procedure in the playbook. The short version: generate the new value → gh secret set → redeploy → verify the live pod carries the new value → deactivate the old value at the provider → append one line to bmad-artifacts/runbooks/key-rotation-log.md. For cross-service tokens (e.g. TOKEN_PAYMENT_GATEWAY shared between the k8s service-tokens Secret and oracle-bridge's .env.staging), rotate consumer side first, then caller, then drop the old token from the consumer — the playbook has the exact sequence.

Drift-detection cron

BL-252 added a nightly audit to catch new drift before it causes an incident.

Workflow: ops-infra/.github/workflows/env-drift-audit.ymlSchedule: 06:00 UTC daily + workflow_dispatch

The workflow clones all audit-target repos, SSHes to AX42-U and the oracle-bridge VPS, runs audit-env-drift.sh --all --env=staging, and compares the result against the newest dated baseline in bmad-artifacts/runbooks/env-drift-audit-YYYY-MM-DD.md. Any var that appears in the live set but not in the baseline opens (or comments on) a GitHub issue in ops-infra and emits a GlitchTip warning event. The scheduled run exits non-zero so new drift surfaces immediately in the Actions UI.

When a story intentionally adds new env vars, regenerate the baseline as documented in that baseline report. The workflow picks up the newest dated file automatically — no workflow edit required.

What NOT to do

Never kubectl set env on a live Deployment outside of audit-env-drift.sh (which is read-only). Every out-of-band set creates exactly the drift PLATFORM-009 exists to eliminate.
Never inline a secret value in a committed manifest. Even briefly — git history is permanent. Once a value lands in git it must be rotated, no exceptions.
Never reuse staging secrets in production. Staging and production tokens, signing PEMs, and API keys must be distinct. Staging key separation is tracked as BL-206 (pre-prod cutover).

Cross-references

Reference	Purpose
`ops-infra/runbooks/secret-hygiene-playbook.md`	Full playbook — patterns, migration checklist, rotation cadence table, emergency rotation procedure
`oracle-bridge VPS`	Pattern C worked example — `.env.{staging,production}` rendered from GH Secrets + scp'd to the VPS, loaded by `docker compose`
`ops-infra/scripts/audit-env-drift.sh`	Read-only drift detection script
`ops-infra/.github/workflows/env-drift-audit.yml`	Nightly drift-detection cron (BL-252)
`bmad-artifacts/runbooks/env-drift-audit-2026-04-19.md`	PLATFORM-009.1 baseline — catalogued 23 drifted vars across the fleet
`bmad-artifacts/runbooks/platform-009-3-founderyos-dashboard-disposition.md`	Pattern B disposition — founderyos-dashboard
`bmad-artifacts/runbooks/platform-009-5-payment-gateway-notification-service-disposition.md`	Pattern A disposition — payment-gateway + notification-service
`ops-infra/runbooks/api-gateway-add-service.md`	PLATFORM-006.4 service-tokens pattern (platform namespace)
`system-topology.md`	Cross-machine architecture overview — where each secret lives (VPS `.env`, k8s Secret namespaces, external provider dashboards) in the broader service map
BL-223	Discovery story — 20+ drifted vars on founderyos-api; triggered PLATFORM-009
BL-244	payment-gateway ICP env reconciliation; Pattern A reference extension
BL-206	Pre-prod key separation — staging vs production signing PEM rotation

Secret Hygiene — Architecture Overview ​

Why it matters ​

The three patterns ​

How to apply ​

Drift-detection cron ​

What NOT to do ​