Watchdog API
The watchdog monitors agent gateway processes for health and automatically recovers from failures. It detects crash loops, performs auto-repair, and sends notifications via Telegram and Discord.How It Works
Configuration
| Variable | Default | Description |
|---|---|---|
WATCHDOG_CHECK_INTERVAL | 120 | Health check interval in seconds |
WATCHDOG_DEGRADED_CHECK_INTERVAL | 5 | Retry interval when degraded (seconds) |
WATCHDOG_MAX_REPAIR_ATTEMPTS | 2 | Max auto-repair attempts before crash-loop |
WATCHDOG_CRASH_LOOP_WINDOW | 300 | Window for crash-loop detection (seconds) |
WATCHDOG_CRASH_LOOP_THRESHOLD | 3 | Crashes in window to trigger crash-loop |
WATCHDOG_AUTO_REPAIR | true | Enable/disable auto-repair |
TELEGRAM_BOT_TOKEN | — | Telegram bot token for notifications |
TELEGRAM_ADMIN_CHAT_ID | — | Chat ID for Telegram notifications |
DISCORD_WEBHOOK_URL | — | Discord webhook URL for notifications |
Lifecycle States
| State | Description |
|---|---|
stopped | Gateway is not running |
starting | Gateway process started, waiting for health check |
running | Gateway is healthy |
degraded | Health check failed, retrying |
crash_loop | Too many crashes, auto-repair exhausted |
repairing | Auto-repair in progress |
Notifications
The watchdog sends notifications for:- Crash detected — Gateway process exited unexpectedly
- Crash loop — 3+ crashes in 5-minute window
- Auto-repair started — Attempting to restart the gateway
- Auto-repair succeeded — Gateway is healthy again
- Auto-repair failed — Could not recover, needs manual intervention
Telegram Format
Discord Format
Embedded message with color coding:- 🔴 Red: crashes, failures
- 🟡 Yellow: repairing, degraded
- 🟢 Green: recovered, healthy