Architecture¶

How OpenClaw observability works — both the official plugin and custom hook-based approach.

Overview: Two Approaches¶

┌─────────────────────────────────────────────────────────────────────┐
│                        OpenClaw Gateway                             │
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                      Agent Execution                         │   │
│  │  message_received → before_agent_start → tool_calls →       │   │
│  │                     tool_result_persist → agent_end          │   │
│  └──────────────────────────┬──────────────────────────────────┘   │
│                              │                                      │
│         ┌────────────────────┼────────────────────┐                │
│         │                    │                    │                │
│         ▼                    ▼                    ▼                │
│  ┌─────────────┐    ┌───────────────┐    ┌─────────────────┐      │
│  │  Diagnostic │    │  Typed Hooks  │    │   Log Output    │      │
│  │   Events    │    │  (api.on())   │    │                 │      │
│  │ (model.usage│    │               │    │                 │      │
│  │  message.*) │    │               │    │                 │      │
│  └──────┬──────┘    └───────┬───────┘    └────────┬────────┘      │
│         │                   │                     │                │
│         ▼                   ▼                     ▼                │
│  ┌─────────────┐    ┌─────────────────┐   ┌──────────────┐        │
│  │  OFFICIAL   │    │     CUSTOM      │   │ Log Forward  │        │
│  │   PLUGIN    │    │     PLUGIN      │   │ (via official│        │
│  │ diagnostics │    │ otel-observ...  │   │   plugin)    │        │
│  │    -otel    │    │                 │   │              │        │
│  └──────┬──────┘    └───────┬─────────┘   └──────┬───────┘        │
│         │                   │                    │                 │
│         └───────────────────┼────────────────────┘                 │
│                             ▼                                      │
│                   ┌─────────────────┐                              │
│                   │ OTLP Exporters  │                              │
│                   │ (HTTP/protobuf) │                              │
│                   └────────┬────────┘                              │
└────────────────────────────┼────────────────────────────────────────┘
                             │
                             ▼
                   ┌─────────────────┐
                   │  OTLP Endpoint  │
                   │ (Collector or   │
                   │  Direct Ingest) │
                   └─────────────────┘

Approach 1: Official Plugin (diagnostics-otel)¶

How It Works¶

The official plugin uses the diagnostic event bus — a publish-subscribe system where the Gateway emits events and plugins consume them.

Gateway Core                    diagnostics-otel Plugin
     │                                   │
     │  emit("model.usage", {...})       │
     │ ─────────────────────────────────>│
     │                                   │  ──> create span
     │                                   │  ──> update counters
     │                                   │  ──> record histogram
     │                                   │
     │  emit("message.processed", {...}) │
     │ ─────────────────────────────────>│
     │                                   │  ──> create span
     │                                   │  ──> update counters

Diagnostic Events¶

Event	When Emitted	Data Included
`model.usage`	After LLM call	tokens, cost, model, duration
`webhook.received`	HTTP request arrives	channel, type
`webhook.processed`	Handler completes	duration, chatId
`webhook.error`	Handler fails	error message
`message.queued`	Added to queue	channel, source, depth
`message.processed`	Processing done	outcome, duration
`queue.lane.enqueue`	Lane add	lane, size
`queue.lane.dequeue`	Lane remove	lane, size, wait time
`session.state`	State change	state, reason
`session.stuck`	Stuck detected	age, queue depth

OTel Signals Created¶

Metrics:

openclaw.tokens{type="input|output|cache_read|cache_write"}
openclaw.cost.usd
openclaw.run.duration_ms
openclaw.context.tokens{type="limit|used"}
openclaw.webhook.received
openclaw.webhook.error
openclaw.webhook.duration_ms
openclaw.message.queued
openclaw.message.processed
openclaw.message.duration_ms
openclaw.queue.depth
openclaw.queue.wait_ms
openclaw.session.state
openclaw.session.stuck
openclaw.session.stuck_age_ms

Traces: - openclaw.model.usage — Per LLM call span - openclaw.webhook.processed — Per webhook span - openclaw.webhook.error — Error span (with status=ERROR) - openclaw.message.processed — Per message span - openclaw.session.stuck — Stuck detection span

Logs: - All Gateway logs as OTel LogRecords - Includes severity, subsystem, code location

Approach 2: Custom Hook-Based Plugin¶

How It Works¶

The custom plugin uses typed plugin hooks — direct callbacks into the agent lifecycle.

Gateway Agent Loop              Custom Plugin
     │                               │
     │  on("message_received")       │
     │ ─────────────────────────────>│  ──> create ROOT span
     │                               │      store in sessionContextMap
     │                               │
     │  on("before_agent_start")     │
     │ ─────────────────────────────>│  ──> create AGENT TURN span
     │                               │      (child of root)
     │                               │
     │  on("tool_result_persist")    │
     │ ─────────────────────────────>│  ──> create TOOL span
     │  (called for each tool)       │      (child of agent turn)
     │                               │
     │  on("agent_end")              │
     │ ─────────────────────────────>│  ──> end agent turn span
     │                               │      end root span
     │                               │      extract tokens from messages

Trace Context Propagation¶

The key difference is trace context propagation. The custom plugin maintains a session-to-context map:

interface SessionTraceContext {
  rootSpan: Span;           // openclaw.request
  rootContext: Context;     // OTel context with root span
  agentSpan?: Span;         // openclaw.agent.turn
  agentContext?: Context;   // OTel context with agent span
  startTime: number;
}

const sessionContextMap = new Map<string, SessionTraceContext>();

When creating child spans, it uses the stored context:

// Tool span becomes child of agent turn
const span = tracer.startSpan(
  `tool.${toolName}`,
  { kind: SpanKind.INTERNAL },
  sessionCtx.agentContext  // <-- parent context
);

Resulting Trace Structure¶

openclaw.request (root)
│   openclaw.session.key: "main@whatsapp:+123..."
│   openclaw.message.channel: "whatsapp"
│   openclaw.request.duration_ms: 4523
│
└── openclaw.agent.turn (child)
    │   gen_ai.usage.input_tokens: 1234
    │   gen_ai.usage.output_tokens: 567
    │   gen_ai.response.model: "claude-opus-4-5-..."
    │   openclaw.agent.duration_ms: 4100
    │
    ├── tool.Read (child)
    │       openclaw.tool.name: "Read"
    │       openclaw.tool.result_chars: 2048
    │
    ├── tool.exec (child)
    │       openclaw.tool.name: "exec"
    │       openclaw.tool.result_chars: 156
    │
    └── tool.Write (child)
            openclaw.tool.name: "Write"
            openclaw.tool.result_chars: 0

Data Flow Comparison¶

Official Plugin: Token Tracking¶

1. Agent calls LLM via pi-ai
2. pi-ai returns response with .usage
3. Gateway calculates cost
4. Gateway emits "model.usage" event with:
   - usage: {input, output, cacheRead, cacheWrite}
   - costUsd: 0.0234
   - model: "claude-..."
   - durationMs: 2341
5. diagnostics-otel receives event
6. Creates metrics + span
7. Batches and exports via OTLP

Custom Plugin: Token Tracking¶

1. Agent calls LLM via pi-ai
2. pi-ai returns response with .usage
3. Gateway fires agent_end hook with:
   - messages: [...including assistant messages with .usage]
4. Custom plugin:
   - Parses messages for usage data
   - Checks for pending diagnostic data (if available)
   - Adds attributes to existing agent turn span
   - Updates counters
5. Ends spans (agent turn, then root)
6. Batches and exports via OTLP

Resource and Attributes¶

Common Attributes¶

Attribute	Description
`service.name`	Service name from config
`openclaw.channel`	Channel (whatsapp, telegram, etc.)
`openclaw.session.key`	Session identifier

Official Plugin Specific¶

Attribute	Description
`openclaw.provider`	LLM provider
`openclaw.model`	Model name
`openclaw.token`	Token type (input/output/cache_*)
`openclaw.webhook`	Webhook update type
`openclaw.outcome`	Message outcome
`openclaw.state`	Session state

Custom Plugin Specific¶

Attribute	Description
`openclaw.agent.id`	Agent identifier
`openclaw.tool.name`	Tool name
`openclaw.tool.call_id`	Tool call UUID
`openclaw.tool.result_chars`	Result size
`gen_ai.usage.input_tokens`	Input token count
`gen_ai.usage.output_tokens`	Output token count
`gen_ai.response.model`	Model used

Performance Considerations¶

Batching¶

Both plugins use batched export: - Traces: BatchSpanProcessor (default 5s or 512 spans) - Metrics: PeriodicExportingMetricReader (default 60s) - Logs: BatchLogRecordProcessor (default 5s)

Overhead¶

Plugin	Overhead Source
Official	Event subscription, metric/span creation
Custom	Hook interception, context map management

Both are lightweight — the OTel SDK handles batching efficiently.

Sampling¶

Reduce trace volume with sampleRate:

{
  "diagnostics": {
    "otel": {
      "sampleRate": 0.1  // 10% of traces
    }
  }
}

When to Use Each¶

Use Case	Recommended
Production monitoring	Official
Cost/token dashboards	Official
Gateway health alerts	Official
Debugging specific requests	Custom
Understanding agent behavior	Custom
Tool execution analysis	Custom
Complete observability	Both