Operations Documentation
This section contains operational documentation for the Hello World DAO LLC platform, including monitoring, runbooks, and incident response procedures.
<repo-root>placeholder: Runbooks in this section use<repo-root>as a placeholder for wherever you've cloned the Hello World DAO repos locally (e.g.~/git,~/code/hello-world, etc.). Substitute your own path. See Workspace Setup for the canonical clone script and recommended layout.
Quick Links
| Resource | Description |
|---|---|
| Canister Monitoring | Cycle balance monitoring and top-up procedures |
| Monitoring & Alerting | GitHub Actions cron + GlitchTip setup |
| Incident Response | General incident handling |
Runbooks
Detailed procedures for common operational tasks:
| Runbook | Purpose |
|---|---|
| Cycles Top-Up | Top up canister cycles when low |
| High Error Rate | Triage elevated error rates |
| Canister Unresponsive | Recover unresponsive canisters |
| Deployment Failure | Handle failed deployments and rollback |
| Database Connectivity | Troubleshoot database and service connectivity |
| Canister Production Activation | Bring a backend canister from staging-only to production-ready |
| Canister Wiring After Reinstall | Restore inter-canister config after any --mode reinstall |
Monitoring Stack
The actual stack is intentionally lightweight — a GitHub Actions cron handles metrics + cycle top-ups, and GlitchTip handles application error tracking. There is no Prometheus / Alertmanager / PagerDuty deployment.
┌──────────────────────────────────────────────────────────┐
│ Monitoring Infrastructure │
├──────────────────────────────────────────────────────────┤
│ Canister metrics + cycle top-ups │
│ └── ops-infra/.github/workflows/monitor-metrics.yml │
│ (cron every 6h — runs check-cycles.sh, top-up) │
│ └── IC Dashboard (https://dashboard.internetcomputer.org)│
├──────────────────────────────────────────────────────────┤
│ Application errors + source maps │
│ └── GlitchTip — https://glitchtip.founderyos.dev/4 │
│ (single project, per-suite tags) │
│ └── Sentry SDK in every suite (auth-token-uploaded maps)│
├──────────────────────────────────────────────────────────┤
│ Notifications │
│ └── GHA workflow failures → email (devops@...) │
│ └── GlitchTip issues → email (devops@...) │
└──────────────────────────────────────────────────────────┘No Prometheus / Alertmanager / PagerDuty / Grafana. Earlier iterations of this doc described those — they were never deployed. The stack above is what actually runs in 2026.
See Monitoring & Alerting for full setup, GlitchTip configuration, and the monitor-metrics.yml cron schedule.
Alert Severity Levels
| Severity | Response Time | Examples |
|---|---|---|
| Critical | < 15 minutes | Canister unresponsive, cycles depleted |
| Warning | < 1 hour | Low cycles, elevated error rate |
| Info | Next business day | No new members, high proposal volume |
On-Call Rotation
On-call engineers are the first responders for production incidents.
Responsibilities:
- Acknowledge alerts within 15 minutes
- Follow runbook procedures
- Escalate P1/P2 incidents to team lead
- Document incident resolution
Contact: See Incident Response Runbook for rotation schedule.
Maintenance Windows
Regular maintenance activities:
| Activity | Schedule | Duration |
|---|---|---|
| Cycles monitoring | Every 6 hours | Automated |
| Dashboard review | Weekly | 30 minutes |
| Runbook review | Monthly | 1 hour |
| DR drill | Quarterly | 2 hours |
Key Metrics
| Metric | Target | Alert Threshold |
|---|---|---|
| Cycles balance | > 1T | < 1T warning, < 500B critical |
| Error rate | < 1% | > 5% warning, > 10% critical |
| Response time | < 2s | > 5s warning |
| Uptime | 99.9% | < 99% critical |
External Services
| Service | Status Page | Purpose |
|---|---|---|
| Internet Computer | status.internetcomputer.org | Blockchain platform |
| PostHog | status.posthog.com | Analytics |
| SendGrid | status.sendgrid.com | Email delivery |
Related Documentation
- CI/CD Pipeline - Deployment and testing workflows
- Architecture - System design overview