Health API

Monitor system health and configure heartbeat schedules.

Health check (web)

GET /api/health

No authentication required. Returns system health status.

The backend service exposes its own health check at GET /health (without the /api prefix). The web and backend health endpoints are independent — the web endpoint reports on the web application process while the backend endpoint reports on the API service. See backend health check below for details.

Breaking change: The health endpoint no longer returns cpu, memory, or uptime fields. These hardware details are now restricted to the admin-only endpoint at /api/admin/health. If you were consuming CPU, memory, or uptime data from this endpoint, update your integration to use the admin endpoint instead.

Response

{
  "status": "ok",
  "health": "healthy",
  "timestamp": "2026-03-19T00:00:00Z"
}

Field	Type	Description
`status`	string	`ok` when the health check completed successfully
`health`	string	Overall system health: `healthy`, `degraded`, or `unhealthy`
`timestamp`	string	ISO 8601 timestamp of the health check

The health field reflects overall system status based on internal CPU and memory thresholds:

Value	Condition
`healthy`	CPU and memory usage both at or below 70%
`degraded`	CPU or memory usage above 70% but at or below 85%
`unhealthy`	CPU or memory usage above 85%

Degraded and unhealthy responses

When the system is degraded or unhealthy, the endpoint still returns HTTP 200 with the health field set to degraded or unhealthy. The status field remains ok.

{
  "status": "ok",
  "health": "unhealthy",
  "timestamp": "2026-03-19T00:00:00Z"
}

Error response

An HTTP 500 is returned only when an unexpected error occurs while collecting health metrics, not for degraded or unhealthy status:

{
  "status": "error",
  "health": "unhealthy",
  "timestamp": "2026-03-19T00:00:00Z"
}

Code	Description
200	Health check succeeded. Check the `health` field for `healthy`, `degraded`, or `unhealthy`.
500	Unexpected error collecting health metrics.

Platform version

GET /api/version

No authentication required. Returns the current platform version string read from the embedded VERSION file.

Response headers

Header	Value	Description
`Cache-Control`	`max-age=3600, stale-while-revalidate=3600`	Response is cached for 1 hour and can be served stale for an additional hour while revalidating in the background.

Response

{
  "version": "v0.1.0"
}

Field	Type	Description
`version`	string	Platform version identifier. Falls back to `v0.0.0` when the version file cannot be read.

Code	Description
200	Version returned

Dashboard health

GET /api/dashboard/health

No authentication required. Checks connectivity to backend services and returns their individual statuses. Use this endpoint to display an aggregated service health overview on a dashboard.

Response headers

Header	Value	Description
`Cache-Control`	`max-age=30, stale-while-revalidate=60`	Response is cached for 30 seconds and can be served stale for an additional 60 seconds while revalidating.

Response

{
  "services": [
    { "name": "Agentbot API", "status": "ok", "detail": "ok" },
    { "name": "Borg-7139", "status": "ok", "detail": "active" },
    { "name": "x402 Gateway", "status": "ok", "detail": "ok" }
  ],
  "timestamp": "2026-04-02T12:00:00.000Z"
}

Field	Type	Description
`services`	array	List of monitored services
`services[].name`	string	Service display name
`services[].status`	string	Service status: `ok`, `degraded`, or `down`
`services[].detail`	string	Additional detail. Contains a normalized status string from the probed service body when `ok` (for example, `active`, `dormant`, `ready`, `inactive`, or a build identifier), an HTTP status code when degraded, or an error label when down.
`timestamp`	string	ISO 8601 timestamp of the health check

The status field for each service reflects the result of an HTTP health probe with a 6-second timeout per candidate URL. When all candidates fail, the primary URL is retried once with an 8-second timeout for diagnostic detail:

Value	Condition
`ok`	Health endpoint returned HTTP 2xx
`degraded`	Health endpoint returned a non-2xx HTTP status
`down`	Health endpoint was unreachable or timed out

When a service is down, the detail field contains a normalized error label rather than the platform’s raw error string. Possible values include timeout (8s), dns error, connection refused, connection reset, socket error, or unreachable.

The Borg-7139 (formerly Tempo Soul) probe attempts the configured SOUL_SERVICE_URL first at /soul/status, then falls back through /health, /healthz, and /readyz on the same host before trying /soul/status and /health on the canonical borg-0-production-7139.up.railway.app host. This fallback chain ensures that a stale SOUL_SERVICE_URL value does not surface a misleading HTTP 404 while the canonical Borg host is healthy. A host is considered healthy when any candidate returns HTTP 2xx; the response body is parsed for an active, ready, status, or build field to populate the detail. Railway uses a TCP port check on port 4023 instead of an HTTP health check for this service.

Code	Description
200	Health check completed (check individual service statuses in the response)

Backend health check

GET /health

No authentication required. Returns backend service status including Railway API availability. This endpoint is served by the backend API service (without the /api prefix).

The backend API continues to serve non-provisioning endpoints (health, metrics, auth, AI, registration) even when the Railway API is not reachable. Agent provisioning and lifecycle operations are disabled until the Railway API becomes available.

Response

{
  "status": "ok",
  "timestamp": "2026-03-19T00:00:00Z",
  "docker": "available",
  "provisioning": "enabled",
  "provider": "render"
}

Field	Type	Description
`status`	string	Always `ok` when the backend is running
`timestamp`	string	ISO 8601 timestamp of the health check
`docker`	string	Provisioning infrastructure availability. `available` when the Railway API is reachable, `unavailable` otherwise. This field name is retained for backward compatibility.
`provisioning`	string	Agent provisioning capability. `enabled` when the Railway API is reachable, `disabled` otherwise.
`provider`	string	Provisioning infrastructure provider. Currently returns `render` for backward compatibility, but the underlying infrastructure uses Railway.

Response when the Railway API is unavailable

When the Railway API is not reachable, the health endpoint still returns HTTP 200 but reports degraded capabilities:

{
  "status": "ok",
  "timestamp": "2026-03-19T00:00:00Z",
  "docker": "unavailable",
  "provisioning": "disabled",
  "provider": "render"
}

When provisioning is disabled, any request to a provisioning-dependent endpoint (such as deploying, starting, stopping, or restarting an agent) returns a 500 error. Non-provisioning endpoints continue to operate normally.

The provider field currently returns render for backward compatibility. Agent containers are now provisioned on Railway. This value may be updated to railway in a future release.

Get heartbeat settings

GET /api/heartbeat?agentId=agent_123

Requires session authentication. Returns the heartbeat configuration for a specific agent. The endpoint first queries the OpenClaw gateway for a heartbeat cron job. If the gateway returns a matching job, the response uses the gateway data. If the gateway is unavailable or no heartbeat job exists, the endpoint falls back to the database.

The source field in the response indicates where the data came from: gateway when read from the gateway’s cron scheduler, or db when read from the database fallback.

Query parameters

Parameter	Type	Required	Description
`agentId`	string	No	The agent to retrieve heartbeat settings for. Required for the database fallback. When the gateway returns a heartbeat job, the `agentId` parameter is not used.

Response (gateway source)

When the gateway has a heartbeat cron job configured:

{
  "source": "gateway",
  "enabled": true,
  "frequency": "1h",
  "nextRun": "2026-03-30T02:00:00Z",
  "lastRun": "2026-03-30T01:00:00Z"
}

Field	Type	Description
`source`	string	Always `gateway` when data is from the gateway
`enabled`	boolean	Whether the heartbeat job is enabled
`frequency`	string	Heartbeat interval derived from the cron schedule (for example, `1h`, `30m`). When the schedule uses milliseconds, the value is converted to hours.
`nextRun`	string \| null	ISO 8601 timestamp of the next scheduled run
`lastRun`	string \| null	ISO 8601 timestamp of the last run

Response (database fallback)

When no gateway heartbeat job is found and agentId is provided:

{
  "source": "db",
  "enabled": true,
  "frequency": "30m",
  "message": "Using defaults — gateway heartbeat not configured"
}

When saved settings exist in the database, the response includes the stored enabled and frequency values. When no agentId is provided and no gateway heartbeat is found:

{
  "source": "db",
  "enabled": false,
  "message": "No agentId provided"
}

Errors

Code	Description
401	Unauthorized
500	Failed to fetch heartbeat settings

Update heartbeat settings

PUT /api/heartbeat

Requires session authentication. Updates heartbeat settings for a specific agent. The endpoint first attempts to write the heartbeat as a cron job on the OpenClaw gateway. If the gateway write succeeds, the response indicates source: "gateway". If the gateway is unavailable or the write fails, the settings are saved to the database as a fallback.

Breaking change: This endpoint now uses the PUT method instead of POST. The POST method is deprecated and may be removed in a future release. Update your integration to use PUT.

Request body

Field	Type	Required	Description
`agentId`	string	Conditional	The agent to update heartbeat settings for. Required when the gateway write fails and the database fallback is used.
`frequency`	string	No	Heartbeat interval. Supported values: `30m`, `1h`, `2h`, `3h`, `6h`, `12h`.
`enabled`	boolean	No	Enable or disable heartbeats. Defaults to `true`.

Response (gateway source)

{
  "success": true,
  "source": "gateway",
  "enabled": true,
  "frequency": "3h"
}

Response (database fallback)

{
  "success": true,
  "source": "db",
  "enabled": true,
  "frequency": "3h"
}

Field	Type	Description
`success`	boolean	`true` on success
`source`	string	Where the settings were saved: `gateway` or `db`
`enabled`	boolean	Whether heartbeats are enabled
`frequency`	string	Configured heartbeat interval

Errors

Code	Description
400	`agentId required` — the `agentId` field is missing and the gateway write failed (database fallback requires `agentId`)
401	Unauthorized
500	Heartbeat update failed

Delete heartbeat settings

Deprecated: The DELETE /api/heartbeat endpoint is deprecated. To disable heartbeats, use PUT /api/heartbeat with "enabled": false instead. When using the gateway, you can also remove the heartbeat cron job directly via DELETE /api/cron?jobId=heartbeat. See the cron API.

DELETE /api/heartbeat

Requires session authentication. Resets heartbeat configuration for a specific agent by removing saved settings from the database.

Request body

Field	Type	Required	Description
`agentId`	string	Yes	The agent to reset heartbeat settings for

Response

{
  "success": true
}

Errors

Code	Description
400	`agentId required` — the `agentId` field is missing from the request body
401	Unauthorized
500	Heartbeat reset failed

Runtime status classification

All health-related endpoints that report agent status use a shared runtime probe. The probe checks three endpoints on each agent service in parallel:

Probe	Timeout	Purpose
`GET /healthz`	5 seconds	Legacy liveness check
`GET /readyz`	4 seconds	Legacy readiness check
`GET /api/status`	5 seconds	Authoritative runtime status (preferred)

GET /api/status is the authoritative health signal. The Railway wrapper uses /api/status as the primary health check. The legacy /healthz and /readyz endpoints may legitimately return 404 on some deployments and should not be treated as the sole indicator of agent health.

The probe classifies agent status using the following priority:

If /api/status returns 200:
- configured: false → status is setup
- running: true or state: "running" → status is running (even if /healthz and /readyz return 404)
- running: false or state: "stopped" → status is stopped
- Other states → falls back to legacy probe results
If /api/status does not return 200:
- /healthz and /readyz both 200 → status is healthy
- /healthz 200 but /readyz not 200 → status is starting
- All probes fail → status is unreachable

The following endpoints use this shared classification:

Container health checks

Agent services run the official OpenClaw image, which exposes built-in health endpoints on port 18789. The backend uses these to determine service readiness during provisioning and ongoing monitoring.

Built-in health endpoints

The OpenClaw image (ghcr.io/openclaw/openclaw:2026.4.27) provides three health endpoints on each agent service:

Endpoint	Purpose	Description
`GET /api/status`	Authoritative status	Returns `200` with runtime state, configuration status, and version info. This is the preferred endpoint for determining agent health.
`GET /healthz`	Liveness (legacy)	Returns `200` when the gateway process is running. May return `404` on some deployments where the legacy endpoint is not registered.
`GET /readyz`	Readiness (legacy)	Returns `200` when the gateway is ready to accept requests. May return `404` on some deployments.

All endpoints are unauthenticated and bind to the service’s internal port (18789).

`/api/status` response

{
  "configured": true,
  "running": true,
  "state": "running",
  "version": "2026.4.11",
  "uptime": "2d 5h",
  "runtime": {
    "ffmpeg": {
      "available": true,
      "version": "ffmpeg version 6.1"
    }
  }
}

Field	Type	Description
`configured`	boolean	`true` when the agent has completed initial setup
`running`	boolean	`true` when the agent process is actively running
`state`	string	Process state string (for example, `running`, `stopped`)
`version`	string	OpenClaw runtime version
`uptime`	string	Human-readable uptime since the process started
`runtime`	object	Runtime capability information
`runtime.ffmpeg`	object	ffmpeg availability and version
`runtime.ffmpeg.available`	boolean	`true` when ffmpeg is installed and executable
`runtime.ffmpeg.version`	string \| null	ffmpeg version string, or `null` when unavailable

`/healthz` response

{
  "ok": true,
  "status": "live"
}

Field	Type	Description
`ok`	boolean	`true` when the gateway process is running
`status`	string	Always `live` when the endpoint responds

The /healthz endpoint may return 404 on deployments where the legacy health route is not registered. Use /api/status as the primary health check instead.

`/readyz` response

{
  "ready": true,
  "failing": [],
  "uptimeMs": 68163
}

Field	Type	Description
`ready`	boolean	`true` when the gateway is ready to accept requests
`failing`	array	List of failing readiness checks. Empty when all checks pass.
`uptimeMs`	number	Gateway uptime in milliseconds since startup

The /readyz endpoint may return 404 on deployments where the legacy readiness route is not registered. Use /api/status as the primary health check instead.

Container health statuses

Status	Condition
`running`	`/api/status` reports the agent process as active
`healthy`	Legacy probes (`/healthz` and `/readyz`) both respond successfully
`starting`	Agent is live but not yet ready to serve requests
`setup`	Agent is reachable but has not completed initial configuration
`stopped`	Agent process has exited or is reported as stopped by `/api/status`
`suspended`	Service has been suspended (saves resources, retains data). Railway does not natively support suspension, so this status indicates the service has been marked idle.
`unreachable`	None of `/api/status`, `/healthz`, or `/readyz` respond
`not_found`	No matching Railway service exists for this agent
`error`	Service is in an unexpected state, build failed, or cannot be inspected

Health check behavior

The shared runtime probe checks /healthz, /readyz, and /api/status in parallel. /api/status is the authoritative signal.
The health check uses a 5-second timeout for /healthz and /api/status, and a 4-second timeout for /readyz.
The waitForHealthy function polls service health every 2 seconds, with a default overall timeout of 60 seconds.

Watchdog monitoring

The backend runs a per-agent watchdog that continuously monitors agent health, detects crash loops, and performs automatic recovery. The watchdog operates internally and does not expose dedicated API endpoints. Status information is surfaced through the existing agent status and lifecycle endpoints.

Health check cycle

The watchdog probes each agent’s gateway using the shared runtime probe (which checks /api/status, /healthz, and /readyz). Health checks run on a configurable interval (default: every 2 minutes). When the probe reports the agent as unhealthy or unreachable, the watchdog transitions the agent to a degraded state and increases the check frequency to every 5 seconds.

Parameter	Default	Environment variable
Health check interval	120 seconds	`WATCHDOG_CHECK_INTERVAL`
Degraded check interval	5 seconds	`WATCHDOG_DEGRADED_CHECK_INTERVAL`
Startup failure threshold	3 consecutive failures	`WATCHDOG_STARTUP_FAILURE_THRESHOLD`
Max repair attempts	2	`WATCHDOG_MAX_REPAIR_ATTEMPTS`
Crash loop window	5 minutes	`WATCHDOG_CRASH_LOOP_WINDOW`
Crash loop threshold	3 crashes in window	`WATCHDOG_CRASH_LOOP_THRESHOLD`

Lifecycle states

The watchdog tracks the following lifecycle states for each agent:

State	Description
`stopped`	Agent is not running
`starting`	Agent service has started; waiting for the first successful health check
`running`	Agent is healthy and serving requests
`degraded`	Health checks are failing after a previous healthy state
`crash_loop`	Multiple crashes detected within the crash loop window
`repairing`	Auto-repair is in progress

Auto-repair

When the watchdog detects an unhealthy agent, it can automatically attempt recovery. Auto-repair is enabled by default and can be disabled by setting the WATCHDOG_AUTO_REPAIR environment variable to false. The repair sequence is:

Kill the agent gateway process
Wait 5 seconds
Restart the gateway
Wait 30 seconds (startup grace period)
Verify health

If the repair fails, the watchdog retries up to the configured maximum (default: 2 attempts). After exhausting all repair attempts, the agent transitions to the crash_loop state.

Crash loop detection

The watchdog tracks crash timestamps within a sliding window (default: 5 minutes). When the number of crashes in the window reaches the threshold (default: 3), the agent enters the crash_loop state. This prevents infinite restart loops for agents with persistent failures.

Notifications

The watchdog sends notifications for critical events (degraded, crash loop, repair attempts) through configured channels:

Telegram — when TELEGRAM_BOT_TOKEN and TELEGRAM_ADMIN_CHAT_ID are set
Discord — when DISCORD_WEBHOOK_URL is set

Railway status webhook

POST /api/webhooks/railway-status

Receives platform status notifications from Railway’s status page and deployment events from the Railway dashboard. This endpoint processes deployment events, incident updates, component status changes, and page-level notifications. Events are persisted to Redis so the dashboard can display real-time Railway status.

This endpoint accepts webhooks from both status.railway.com (incident and component updates) and the Railway dashboard (deployment events). Configure webhook subscriptions in both locations to point to this URL.

Authentication

The RAILWAY_WEBHOOK_SECRET environment variable must be configured. Every request must include the secret via one of the following methods:

Method	Location	Description
Header	`x-railway-secret`	Shared secret in a custom request header
Query parameter	`?secret=`	Shared secret as a URL query parameter

The secret is verified using a constant-time comparison to prevent timing attacks.

The webhook fails closed when RAILWAY_WEBHOOK_SECRET is not set. Unsigned POSTs are rejected with 503 Service Unavailable to prevent unauthenticated parties from injecting fake status notifications and poisoning the cached railway:status:latest record in Redis.

Request body

The endpoint accepts two payload formats: deployment events from the Railway dashboard and status-page events from Railway’s status page.

Deployment event

Sent by Railway when a deployment status changes.

Field	Type	Required	Description
`type`	string	No	Event type identifier
`deployment`	object	No	Deployment details
`deployment.id`	string	No	Deployment identifier
`deployment.status`	string	No	Current deployment status (for example, `SUCCESS`, `FAILED`, `BUILDING`, `DEPLOYING`)
`deployment.url`	string	No	Deployment URL
`deployment.service`	object	No	Service metadata
`deployment.service.name`	string	No	Name of the deployed service

Status-page event

Sent by Railway’s status page for incident and component updates. The payload follows the Railway status page webhook format.

Field	Type	Required	Description
`incident`	object	No	Incident details including `name`, `status`, and `incident_updates`
`incident.name`	string	No	Name of the incident
`incident.status`	string	No	Current incident status (for example, `investigating`, `identified`, `monitoring`, `resolved`)
`incident.incident_updates`	array	No	List of update objects. The first entry’s `body` field contains the latest update message.
`component`	object	No	Component status change details
`component.name`	string	No	Name of the affected component
`component.status`	string	No	Current component status (for example, `operational`, `degraded_performance`, `partial_outage`, `major_outage`)
`page`	object	No	Page-level status information

Response

On success, the endpoint returns the received event along with the persisted record:

{
  "received": true,
  "record": {
    "status": "SUCCESS",
    "name": "my-service",
    "message": "https://my-service.up.railway.app",
    "eventType": "deployment",
    "receivedAt": "2026-03-27T12:00:00.000Z"
  }
}

Field	Type	Description
`received`	boolean	Always `true` on success
`record`	object	The status record persisted to Redis
`record.status`	string	Normalized status value from the event
`record.name`	string	Service or incident name. Defaults to `"Railway"` for status-page events.
`record.message`	string	Deployment URL or latest incident update body
`record.eventType`	string	One of `deployment`, `incident`, `component`, or the `type` field from the payload
`record.receivedAt`	string	ISO 8601 timestamp when the event was received

The record is stored in Redis under the key railway:status:latest with a 7-day TTL. When Redis is not configured (KV_REST_API_URL and KV_REST_API_TOKEN not set), the endpoint still processes the event and returns the record but does not persist it.

Error response

Returned when the request body is not valid JSON:

{
  "error": "Invalid payload"
}

Code	Description
200	Webhook payload received and processed
400	Invalid JSON payload
401	`Unauthorized` — missing or invalid secret
503	`Webhook not configured` — the `RAILWAY_WEBHOOK_SECRET` environment variable is not set on the server

Example payloads

Deployment event

{
  "type": "deployment.completed",
  "deployment": {
    "id": "dep_abc123",
    "status": "SUCCESS",
    "url": "https://my-service.up.railway.app",
    "service": {
      "name": "my-service"
    }
  }
}

Incident event

{
  "incident": {
    "name": "Elevated error rates on US-West deployments",
    "status": "investigating",
    "incident_updates": [
      {
        "body": "We are investigating elevated error rates affecting deployments in the US-West region."
      }
    ]
  }
}

Railway status polling

GET /api/webhooks/railway-status

Returns the last-known Railway status from Redis. No authentication required. Use this endpoint to display Railway platform status on your dashboard.

Response

When a status event has been received and persisted:

{
  "status": "SUCCESS",
  "lastEvent": {
    "status": "SUCCESS",
    "name": "my-service",
    "message": "https://my-service.up.railway.app",
    "eventType": "deployment",
    "receivedAt": "2026-03-27T12:00:00.000Z"
  },
  "endpoint": "railway-status-webhook"
}

Field	Type	Description
`status`	string	Status from the most recent event, or `no-events` if no events have been received
`lastEvent`	object \| null	The full status record from the last webhook event, or `null` if no events exist
`endpoint`	string	Always `railway-status-webhook`

When no events have been received:

{
  "status": "no-events",
  "lastEvent": null,
  "endpoint": "railway-status-webhook"
}

When Redis is not configured (KV_REST_API_URL and KV_REST_API_TOKEN not set):

{
  "status": "unknown",
  "message": "Redis not configured",
  "endpoint": "railway-status-webhook"
}

Code	Description
200	Status retrieved (or fallback returned when Redis is unavailable)

Getting Started

Core Concepts

Integrations

Core Services

Payments

API Reference

MCP

Security

Documentation Index

​Health API

​Health check (web)

​Response

​Degraded and unhealthy responses

​Error response

​Platform version

​Response headers

​Response

​Dashboard health

​Response headers

​Response

​Backend health check

​Response

​Response when the Railway API is unavailable

​Get heartbeat settings

​Query parameters

​Response (gateway source)

​Response (database fallback)

​Errors

​Update heartbeat settings

​Request body

​Response (gateway source)

​Response (database fallback)

​Errors

​Delete heartbeat settings

​Request body

​Response

​Errors

​Runtime status classification

​Container health checks

​Built-in health endpoints

​/api/status response

​/healthz response

​/readyz response

​Container health statuses

​Health check behavior

​Watchdog monitoring

​Health check cycle

​Lifecycle states

​Auto-repair

​Crash loop detection

​Notifications

​Railway status webhook

​Authentication

​Request body

​Deployment event

​Status-page event

​Response

​Error response

​Example payloads

​Deployment event

​Incident event

​Railway status polling

​Response

Health API

Health check (web)

Response

Degraded and unhealthy responses

Error response

Platform version

Response headers

Response

Dashboard health

Response headers

Response

Backend health check

Response

Response when the Railway API is unavailable

Get heartbeat settings

Query parameters

Response (gateway source)

Response (database fallback)

Errors

Update heartbeat settings

Request body

Response (gateway source)

Response (database fallback)

Errors

Delete heartbeat settings

Request body

Response

Errors

Runtime status classification

Container health checks

Built-in health endpoints

`/api/status` response

`/healthz` response

`/readyz` response

Container health statuses

Health check behavior

Watchdog monitoring

Health check cycle

Lifecycle states

Auto-repair

Crash loop detection

Notifications

Railway status webhook

Authentication

Request body

Deployment event

Status-page event

Response

Error response

Example payloads

Deployment event

Incident event

Railway status polling

Response