Skip to content

Checking access...

Monitoring & Alerting

This section covers the monitoring infrastructure for the Hello World DAO LLC platform: canister cycle monitoring + auto top-up via GitHub Actions, and application error tracking via GlitchTip.

Overview

The monitoring stack is intentionally lightweight — there is no Prometheus, Alertmanager, Grafana, or PagerDuty. Two systems carry the load:

  • Canister metrics + cycle top-ups: A GitHub Actions cron (monitor-metrics.yml in ops-infra) runs every 6 hours, calls check-cycles.sh against the 12 backend canisters + 6 frontend asset canisters, and auto-tops-up any canister below threshold.
  • Application error tracking: GlitchTip (self-hosted Sentry-compatible service at glitchtip.founderyos.dev) captures runtime errors from every suite. Source maps are uploaded by CI on every release.

Earlier drafts of this page mentioned Prometheus rules, Alertmanager, Grafana dashboards, Slack/PagerDuty routing — none of those are deployed. If you see references to them in older docs, they are aspirational, not current.

ResourceDescription
GlitchTipApplication errors (single project, per-suite tags)
IC DashboardInternet Computer canister + subnet status
GitHub Actions — ops-inframonitor-metrics.yml cron + manual triggers
Canister Cycle MonitoringCanonical canister inventory + cycle budgets

Monitoring Architecture

┌──────────────┐    ┌─────────────────────┐    ┌─────────────────────┐
│  Canisters   │───▶│  GitHub Actions cron │───▶│  Workflow summary    │
│  (IC mainnet)│    │  monitor-metrics.yml │    │  + auto top-up       │
└──────────────┘    │  (every 6h)          │    └─────────────────────┘
                    └─────────────────────┘             │

                                                 Email on failure
                                                 (devops@helloworlddao.com)

┌──────────────┐    ┌─────────────────────┐    ┌─────────────────────┐
│  6 Suites    │───▶│  Sentry SDK          │───▶│  GlitchTip           │
│  (browser)   │    │  (per-suite DSN tag) │    │  glitchtip.          │
└──────────────┘    └─────────────────────┘    │  founderyos.dev/4    │
                                               └─────────────────────┘


                                                 Email on new issue

Canister Metrics — GitHub Actions Cron

Workflow

ops-infra/.github/workflows/monitor-metrics.yml

  • Schedule: every 6 hours (0 */6 * * *)
  • Trigger (manual): gh workflow run monitor-metrics.yml --repo Hello-World-Co-Op/ops-infra
  • Identity: github-ci dfx identity (PEM in DFX_IDENTITY_PEM secret)
  • Action: runs check-cycles.sh against the canister fleet, auto-tops-up canisters below threshold (default: 100B cycles, 0.05 ICP top-up amount)

Thresholds

ThresholdCyclesAction
Critical< 100B (0.1 TC)Auto top-up + workflow exits non-zero
Warning< 500B (0.5 TC)Logged in workflow summary; manual review
Healthy> 1 TCNo action

Reading workflow output

Each run produces a markdown summary (visible on the workflow run page) listing per-canister balances and any top-ups performed. If the workflow fails, an email goes to devops@helloworlddao.com.

For deep-dive procedures see:

  • Canister Cycle Monitoring — canonical inventory + standalone check script
  • Automated Cycles Top-Up System — full GHA cron walkthrough at operations/automated-cycles-topup.md in the repo (excluded from rendered site)
  • Cycles Top-Up runbook — what to do when an alert fires

Application Errors — GlitchTip

Project

  • Endpoint: https://glitchtip.founderyos.dev
  • Project ID: 4 (single project, all suites tagged)
  • DSN format: https://<key>@glitchtip.founderyos.dev/4 (UUID dashes stripped — Sentry SDK rejects dashes)
  • Auth token (CI source-map upload): SENTRY_AUTH_TOKEN GH secret in each suite repo

What gets sent

SuiteTagSource-map upload
dao-suitesuite=daoOn release in CI
dao-admin-suitesuite=dao-adminOn release in CI
governance-suitesuite=governanceOn release in CI
marketing-suitesuite=marketingOn release in CI
otter-camp-suitesuite=otter-campOn release in CI
think-tank-suitesuite=think-tankOn release in CI

Notification

GlitchTip emails devops@helloworlddao.com on first occurrence of a new issue + on regression. There is no Slack/PagerDuty hook.

Common pitfalls

  • DSN with dashesnormalizeDsn() MUST strip dashes from the UUID key. The Sentry SDK silently rejects dashed UUIDs.
  • Celery worker outages — GlitchTip uses Celery; if events stop arriving, check the worker on the FOS cluster (Graydon owns this, but symptoms surface here).
  • Source maps missing — verify the suite's release CI uploaded them; check release-please PR was actually released.

Alert Thresholds (canister + UX)

AlertSourceThresholdSeverityResponse Time
Low cyclesmonitor-metrics.yml< 1T cyclesWarning< 1 hour
Critical cyclesmonitor-metrics.yml< 500B cyclesCritical< 15 minutes
New JS errorGlitchTipfirst occurrenceInfoTriage same day
Error spikeGlitchTip> 10x baselineWarningTriage same day

Runbooks

For operational procedures, see:

TopicRunbook
Low cyclesCycles Top-Up Procedure
High errorsHigh Error Rate Triage
Canister downCanister Unresponsive Recovery
Failed deployDeployment Failure Recovery
Database issuesDatabase Connectivity

Setup Guide

1. GitHub Secrets

Add these to the ops-infra repository (Settings → Secrets and variables → Actions):

SecretPurpose
DFX_IDENTITY_PEMdfx identity for canister status checks + top-ups
IC_PRINCIPALIdentity principal (info only)

Each suite repo also needs:

SecretPurpose
SENTRY_AUTH_TOKENGlitchTip auth token (scopes: project:releases, org:read) for CI source-map upload
SENTRY_DSN (env var, not secret)Per-suite DSN — public, exposed in .env.staging / .env.production

2. Wire suite Sentry SDK

ts
// In each suite's main.ts
import * as Sentry from '@sentry/react';

Sentry.init({
  dsn: normalizeDsn(import.meta.env.VITE_SENTRY_DSN),  // strips dashes from UUID
  environment: import.meta.env.MODE,
  release: import.meta.env.VITE_RELEASE_VERSION,
  initialScope: { tags: { suite: 'dao-admin' } },     // per-suite tag
});

3. Confirm cron + alerts

  • After enabling monitor-metrics.yml, manually trigger one run and confirm summary lists all canisters.
  • Trigger a synthetic Sentry error from a deployed suite and confirm it appears in GlitchTip within ~30 seconds.

Troubleshooting

Cycle workflow fails

  1. Check the workflow run page for the dfx command that failed.
  2. Common causes: DFX_IDENTITY_PEM rotated, ICP wallet empty, IC subnet outage (check IC dashboard).
  3. Re-run from the Actions tab once the underlying issue is fixed.

GlitchTip stops receiving events

  1. Open https://glitchtip.founderyos.dev — verify the UI loads.
  2. Check the FOS k8s cluster's GlitchTip Celery worker (kubectl -n hello-world get pods -l app=glitchtip-worker).
  3. Verify the suite is sending — open browser devtools, trigger an error, check Network tab for a /api/<id>/store/ POST.
  4. If POSTs are rejected with 400, the DSN UUID likely has dashes — verify normalizeDsn() is applied.

"Canister not in monitoring"

  1. Open ops-infra/scripts/check-cycles.sh.
  2. Confirm the canister ID is in the CANISTERS=() array.
  3. If missing, add it (one entry per active canister) and PR the change.

Hello World Co-Op DAO