Operations Documentation

This section contains operational documentation for the Hello World DAO LLC platform, including monitoring, runbooks, and incident response procedures.

<repo-root> placeholder: Runbooks in this section use <repo-root> as a placeholder for wherever you've cloned the Hello World DAO repos locally (e.g. ~/git, ~/code/hello-world, etc.). Substitute your own path. See Workspace Setup for the canonical clone script and recommended layout.

Quick Links

Resource	Description
Canister Monitoring	Cycle balance monitoring and top-up procedures
Monitoring & Alerting	GitHub Actions cron + GlitchTip setup
Incident Response	General incident handling

Runbooks

Detailed procedures for common operational tasks:

Runbook	Purpose
Cycles Top-Up	Top up canister cycles when low
High Error Rate	Triage elevated error rates
Canister Unresponsive	Recover unresponsive canisters
Deployment Failure	Handle failed deployments and rollback
Database Connectivity	Troubleshoot database and service connectivity
Canister Production Activation	Bring a backend canister from staging-only to production-ready
Canister Wiring After Reinstall	Restore inter-canister config after any `--mode reinstall`

Monitoring Stack

The actual stack is intentionally lightweight — a GitHub Actions cron handles metrics + cycle top-ups, and GlitchTip handles application error tracking. There is no Prometheus / Alertmanager / PagerDuty deployment.

┌──────────────────────────────────────────────────────────┐
│                Monitoring Infrastructure                  │
├──────────────────────────────────────────────────────────┤
│  Canister metrics + cycle top-ups                        │
│  └── ops-infra/.github/workflows/monitor-metrics.yml     │
│      (cron every 6h — runs check-cycles.sh, top-up)      │
│  └── IC Dashboard (https://dashboard.internetcomputer.org)│
├──────────────────────────────────────────────────────────┤
│  Application errors + source maps                        │
│  └── GlitchTip — https://glitchtip.founderyos.dev/4      │
│      (single project, per-suite tags)                    │
│  └── Sentry SDK in every suite (auth-token-uploaded maps)│
├──────────────────────────────────────────────────────────┤
│  Notifications                                           │
│  └── GHA workflow failures → email (devops@...)          │
│  └── GlitchTip issues → email (devops@...)               │
└──────────────────────────────────────────────────────────┘

No Prometheus / Alertmanager / PagerDuty / Grafana. Earlier iterations of this doc described those — they were never deployed. The stack above is what actually runs in 2026.

See Monitoring & Alerting for full setup, GlitchTip configuration, and the monitor-metrics.yml cron schedule.

Alert Severity Levels

Severity	Response Time	Examples
Critical	< 15 minutes	Canister unresponsive, cycles depleted
Warning	< 1 hour	Low cycles, elevated error rate
Info	Next business day	No new members, high proposal volume

On-Call Rotation

On-call engineers are the first responders for production incidents.

Responsibilities:

Acknowledge alerts within 15 minutes
Follow runbook procedures
Escalate P1/P2 incidents to team lead
Document incident resolution

Contact: See Incident Response Runbook for rotation schedule.

Maintenance Windows

Regular maintenance activities:

Activity	Schedule	Duration
Cycles monitoring	Every 6 hours	Automated
Dashboard review	Weekly	30 minutes
Runbook review	Monthly	1 hour
DR drill	Quarterly	2 hours

Key Metrics

Metric	Target	Alert Threshold
Cycles balance	> 1T	< 1T warning, < 500B critical
Error rate	< 1%	> 5% warning, > 10% critical
Response time	< 2s	> 5s warning
Uptime	99.9%	< 99% critical

External Services

Service	Status Page	Purpose
Internet Computer	status.internetcomputer.org	Blockchain platform
PostHog	status.posthog.com	Analytics
SendGrid	status.sendgrid.com	Email delivery

CI/CD Pipeline - Deployment and testing workflows
Architecture - System design overview

Operations Documentation ​

Quick Links ​

Runbooks ​

Monitoring Stack ​

Alert Severity Levels ​

On-Call Rotation ​

Maintenance Windows ​

Key Metrics ​

External Services ​

Related Documentation ​