Skip to content

Architecture

How OpenClaw observability works — both the official plugin and custom hook-based approach.

Overview: Two Approaches

┌─────────────────────────────────────────────────────────────────────┐
│                        OpenClaw Gateway                             │
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                      Agent Execution                         │   │
│  │  message_received → before_model_resolve →                  │   │
│  │                     before_prompt_build → tool_calls →       │   │
│  │                     tool_result_persist → agent_end          │   │
│  └──────────────────────────┬──────────────────────────────────┘   │
│                              │                                      │
│         ┌────────────────────┼────────────────────┐                │
│         │                    │                    │                │
│         ▼                    ▼                    ▼                │
│  ┌─────────────┐    ┌───────────────┐    ┌─────────────────┐      │
│  │  Diagnostic │    │  Typed Hooks  │    │   Log Output    │      │
│  │   Events    │    │  (api.on())   │    │                 │      │
│  │ (model.usage│    │               │    │                 │      │
│  │  message.*) │    │               │    │                 │      │
│  └──────┬──────┘    └───────┬───────┘    └────────┬────────┘      │
│         │                   │                     │                │
│         ▼                   ▼                     ▼                │
│  ┌─────────────┐    ┌─────────────────┐   ┌──────────────┐        │
│  │  OFFICIAL   │    │     CUSTOM      │   │ Log Forward  │        │
│  │   PLUGIN    │    │     PLUGIN      │   │ (via official│        │
│  │ diagnostics │    │ otel-observ...  │   │   plugin)    │        │
│  │    -otel    │    │                 │   │              │        │
│  └──────┬──────┘    └───────┬─────────┘   └──────┬───────┘        │
│         │                   │                    │                 │
│         └───────────────────┼────────────────────┘                 │
│                             ▼                                      │
│                   ┌─────────────────┐                              │
│                   │ OTLP Exporters  │                              │
│                   │ (HTTP/protobuf) │                              │
│                   └────────┬────────┘                              │
└────────────────────────────┼────────────────────────────────────────┘
                   ┌─────────────────┐
                   │  OTLP Endpoint  │
                   │ (Collector or   │
                   │  Direct Ingest) │
                   └─────────────────┘

Approach 1: Official Plugin (diagnostics-otel)

How It Works

The official plugin uses the diagnostic event bus — a publish-subscribe system where the Gateway emits events and plugins consume them.

Gateway Core                    diagnostics-otel Plugin
     │                                   │
     │  emit("model.usage", {...})       │
     │ ─────────────────────────────────>│
     │                                   │  ──> create span
     │                                   │  ──> update counters
     │                                   │  ──> record histogram
     │                                   │
     │  emit("message.processed", {...}) │
     │ ─────────────────────────────────>│
     │                                   │  ──> create span
     │                                   │  ──> update counters

Diagnostic Events

Event When Emitted Data Included
model.usage After LLM call tokens, cost, model, duration
webhook.received HTTP request arrives channel, type
webhook.processed Handler completes duration, chatId
webhook.error Handler fails error message
message.queued Added to queue channel, source, depth
message.processed Processing done outcome, duration
queue.lane.enqueue Lane add lane, size
queue.lane.dequeue Lane remove lane, size, wait time
session.state State change state, reason
session.stuck Stuck detected age, queue depth

OTel Signals Created

Everything in this subsection is produced by the Gateway-built-in diagnostics-otel plugin. The custom plugin in this repo (Approach 2 below) emits a different metric set (openclaw.llm.* + gen_ai.*).

Metrics (emitted by diagnostics-otel):

openclaw.tokens{type="input|output|cache_read|cache_write"}
openclaw.cost.usd
openclaw.run.duration_ms
openclaw.context.tokens{type="limit|used"}
openclaw.webhook.received
openclaw.webhook.error
openclaw.webhook.duration_ms
openclaw.message.queued
openclaw.message.processed
openclaw.message.duration_ms
openclaw.queue.depth
openclaw.queue.wait_ms
openclaw.session.state
openclaw.session.stuck
openclaw.session.stuck_age_ms

Traces (emitted by diagnostics-otel): - openclaw.model.usage — Per LLM call span - openclaw.webhook.processed — Per webhook span - openclaw.webhook.error — Error span (with status=ERROR) - openclaw.message.processed — Per message span - openclaw.session.stuck — Stuck detection span

Logs (emitted by diagnostics-otel): - All Gateway logs as OTel LogRecords - Includes severity, subsystem, code location


Approach 2: Custom Hook-Based Plugin

Plugin Lifecycle

OpenClaw drives plugins through three phases. Mixing them up is the single most common way to break the custom plugin — if typed hooks are registered in the wrong phase, the gateway never sees them and no spans are produced. The current layout:

┌──────────────┐      ┌──────────────┐      ┌──────────────┐
│  register()  │ ───▶ │    start()   │ ───▶ │    stop()    │
│  synchronous │      │     async    │      │     async    │
└──────────────┘      └──────────────┘      └──────────────┘
        │                     │                     │
        │                     │                     │
        ▼                     ▼                     ▼
  - api.on(*)           - initTelemetry()    - stopHooks()
  - api.registerHook()  - initOpenLLMetry()  - unsubscribe()
  - api.registerGate…   - registerDiagnost… - telemetry.shutdown()
  - api.registerCli()
  - api.registerService()
  - api.registerTool()
        │                     │
        └─── lazy getter ─────┘
           () => telemetry
Phase Runs Responsibility
register() Synchronous, before the gateway accepts traffic Wire every typed hook (message_received, session_start, session_end, before_model_resolve, before_prompt_build, llm_input, llm_output, model_call_started, model_call_ended, before_dispatch, reply_dispatch, before_tool_call, after_tool_call, tool_approval_resolution, tool_result_persist, message_sent, before_agent_finalize, agent_end, before_reset, cron hooks, subagent hooks), event-stream hooks (command:*, gateway:startup), RPC method, CLI command, background service, and agent tool.
start() Async, once the gateway is ready Build the OTel runtime (initTelemetry → TracerProvider + MeterProvider), optionally wrap LLM SDKs with OpenLLMetry when traces is on, and subscribe to OpenClaw diagnostic events for cost/token data.
stop() Async, on gateway reload or shutdown Clear the stale-session sweeper setInterval (see b668a4f, ISI-522), unsubscribe from diagnostics, and call telemetry.shutdown() so batched spans/metrics flush before the process exits.

Lazy telemetry getter

Hooks need to be registered in register() — which is synchronous and runs before initTelemetry() — but they need to read an OTel runtime that only exists after start(). The plugin solves this by registering hooks with a lazy telemetry getter instead of a concrete runtime:

let telemetry: TelemetryRuntime | null = null;

// Registered in register(), resolves telemetry at call time.
let stopHooks = registerHooks(api, () => telemetry, config);

api.registerService({
  id: "otel-observability",
  start: async () => {
    telemetry = initTelemetry(config, logger);     // populated here
    if (config.traces) await initOpenLLMetry(config, logger);
    unsubscribeDiagnostics = await registerDiagnosticsListener(telemetry, logger);
  },
  stop: async () => {
    stopHooks?.();                                  // clearInterval
    unsubscribeDiagnostics?.();
    await telemetry?.shutdown();
    telemetry = null;
  },
});

Each hook handler opens with:

const telemetry = getTelemetry();
if (!telemetry) return;

so any hook that fires between register() and start() completing is a clean no-op. Once initTelemetry() runs, the next invocation sees a live runtime and begins emitting spans.

How It Works

The custom plugin uses typed plugin hooks — direct callbacks into the agent lifecycle.

Gateway Agent Loop              Custom Plugin
     │                               │
     │  on("message_received")       │
     │ ─────────────────────────────>│  ──> create ROOT span
     │                               │      store in sessionContextMap
     │                               │
     │  on("before_model_resolve")   │
     │ ─────────────────────────────>│  ──> create AGENT TURN span
     │                               │      (child of root)
     │                               │
     │  on("before_prompt_build")    │
     │ ─────────────────────────────>│  ──> enrich AGENT TURN span
     │                               │      with prompt.chars +
     │                               │      session.message_count
     │                               │
     │  on("tool_result_persist")    │
     │ ─────────────────────────────>│  ──> create TOOL span
     │  (called for each tool)       │      (child of agent turn)
     │                               │
     │  on("agent_end")              │
     │ ─────────────────────────────>│  ──> end agent turn span
     │                               │      end root span
     │                               │      extract tokens from messages

Trace Context Propagation

The key difference is trace context propagation. The custom plugin maintains a session-to-context map:

interface SessionTraceContext {
  rootSpan: Span;           // openclaw.request
  rootContext: Context;     // OTel context with root span
  agentSpan?: Span;         // openclaw.agent.turn
  agentContext?: Context;   // OTel context with agent span
  startTime: number;
}

const sessionContextMap = new Map<string, SessionTraceContext>();

When creating child spans, it uses the stored context:

// Tool span becomes child of agent turn
const span = tracer.startSpan(
  `tool.${toolName}`,
  { kind: SpanKind.INTERNAL },
  sessionCtx.agentContext  // <-- parent context
);

Resulting Trace Structure

openclaw.request (root)
│   openclaw.session.key: "main@whatsapp:+123..."
│   openclaw.message.channel: "whatsapp"
├── openclaw.session (long-lived, covers entire conversation)
│   gen_ai.conversation.id: "session-abc"
│   openclaw.session.channel: "whatsapp"
│   openclaw.session.user_id: "user-42"
│   user.id: "user-42"
└── openclaw.agent.turn (child)
    │   gen_ai.operation.name: "invoke_agent"
    │   gen_ai.usage.input_tokens: 1234
    │   gen_ai.usage.output_tokens: 567
    │   gen_ai.response.model: "claude-opus-4-5-..."
    │   gen_ai.provider.name: "anthropic"
    │   openclaw.agent.duration_ms: 4100
    │   openclaw.prompt.chars: 256
    │   openclaw.session.message_count: 8
    ├── openclaw.dispatch.prepare
    │       gen_ai.request.model: "claude-opus-4-5-..."
    ├── chat claude-opus-4-5-20250514 (model call span)
    │       gen_ai.provider.name: "anthropic"
    │       gen_ai.request.model: "claude-opus-4-5-..."
    │       gen_ai.response.model: "claude-opus-4-5-20250514"
    │       gen_ai.usage.input_tokens: 1234
    │       gen_ai.usage.output_tokens: 567
    │       gen_ai.usage.cache_read.input_tokens: 800
    │       gen_ai.response.finish_reasons: "end_turn"
    ├── execute_tool Read (tool span)
    │       gen_ai.tool.name: "Read"
    │       gen_ai.operation.name: "execute_tool"
    │       openclaw.tool.duration_ms: 45
    │       openclaw.tool.result_chars: 2048
    ├── execute_tool Bash (tool span)
    │       gen_ai.tool.name: "Bash"
    │       openclaw.tool.input_preview: '{"command":"ls -la"}'
    │       openclaw.tool.duration_ms: 120
    └── execute_tool Write (tool span)
            gen_ai.tool.name: "Write"
            openclaw.tool.result_chars: 0

OTel Signals Created

Metrics (emitted by this plugin):

openclaw.llm.tokens.total        # counter, by gen_ai.response.model
openclaw.llm.tokens.prompt       # counter
openclaw.llm.tokens.completion   # counter
openclaw.llm.cost.usd            # counter, by gen_ai.response.model
openclaw.tool.calls              # counter
openclaw.session.resets          # counter

The OTel-stable gen_ai.usage.input_tokens / gen_ai.usage.output_tokens are recorded as span attributes on the LLM/agent-turn spans (see the trace structure above) — not as separate metric instruments.

Traces (emitted by this plugin): see the trace tree above (openclaw.requestopenclaw.sessionopenclaw.agent.turn → child spans).

Note: The legacy openclaw.tokens / openclaw.cost.usd counters are emitted only by the Gateway's built-in diagnostics-otel plugin (Approach 1). They are not emitted by this plugin.


Data Flow Comparison

Official Plugin: Token Tracking

1. Agent calls LLM via pi-ai
2. pi-ai returns response with .usage
3. Gateway calculates cost
4. Gateway emits "model.usage" event with:
   - usage: {input, output, cacheRead, cacheWrite}
   - costUsd: 0.0234
   - model: "claude-..."
   - durationMs: 2341
5. diagnostics-otel receives event
6. Creates metrics + span
7. Batches and exports via OTLP

Custom Plugin: Token Tracking

1. Agent calls LLM via pi-ai
2. pi-ai returns response with .usage
3. Gateway fires agent_end hook with:
   - messages: [...including assistant messages with .usage]
4. Custom plugin:
   - Parses messages for usage data
   - Checks for pending diagnostic data (if available)
   - Adds attributes to existing agent turn span
   - Updates counters
5. Ends spans (agent turn, then root)
6. Batches and exports via OTLP

Resource and Attributes

Common Attributes

Attribute Description
service.name Service name from config
openclaw.channel Channel (whatsapp, telegram, etc.)
openclaw.session.key Session identifier

Official Plugin Specific

Attribute Description
openclaw.provider LLM provider
openclaw.model Model name
openclaw.token Token type (input/output/cache_*)
openclaw.webhook Webhook update type
openclaw.outcome Message outcome
openclaw.state Session state

Custom Plugin Specific

Attribute Description
gen_ai.operation.name Operation: invoke_agent, chat, execute_tool
gen_ai.request.model Requested model name
gen_ai.response.model Actual model used
gen_ai.response.id LLM response ID
gen_ai.response.finish_reasons Stop reasons
gen_ai.usage.input_tokens Input token count
gen_ai.usage.output_tokens Output token count
gen_ai.usage.cache_read.input_tokens Cache read tokens
gen_ai.usage.cache_creation.input_tokens Cache creation tokens
gen_ai.request.stream Whether streaming
gen_ai.request.max_tokens Max token limit
gen_ai.provider.name Provider name
openclaw.tool.approval.requested Approval required (renamed from gen_ai.tool.approval.requested in schema 1.1.0)
openclaw.tool.approval.resolution Approved/denied (renamed from gen_ai.tool.approval.resolution in schema 1.1.0)
openclaw.tool.approval.duration_ms Approval wait time (renamed from gen_ai.tool.approval.duration_ms in schema 1.1.0)
openclaw.agent.id Agent identifier
openclaw.tool.name Tool name
openclaw.tool.call_id Tool call UUID
openclaw.tool.result_chars Result size
openclaw.tool.duration_ms Tool execution time
openclaw.session.channel Channel (whatsapp, cli, etc.)
openclaw.session.user_id User identifier (kept for backward compatibility — see user.id)
user.id OTel-stable end-user id (ISI-995). Mirrors openclaw.session.user_id on the openclaw.session span so registry-keyed dashboards can correlate sessions on a standard attribute.
openclaw.prompt.chars Prompt character count
openclaw.session.message_count History size fed to LLM
openclaw.dispatch.duration_ms Dispatch phase duration

Removed attributes — dual-emit window closed (schema 1.3.0)

Schema 1.3.0 (ISI-1004) closes the dual-emit window opened in 1.2.0 (ISI-994). The legacy OTel semconv keys are no longer emitted — dashboards, alerts, and queries must read the stable replacements.

Removed (1.3.0) Stable replacement (shipped in 1.2.0)
gen_ai.system gen_ai.provider.name
code.function + code.namespace code.function.name (= ${namespace}.${function}) + code.file.path
gen_ai.usage.cache_read_tokens gen_ai.usage.cache_read.input_tokens
gen_ai.usage.cache_write_tokens gen_ai.usage.cache_creation.input_tokens
gen_ai.usage.total_tokens (none — compute input + output)

The constants that exported the removed keys (GEN_AI_SYSTEM, CODE_FUNCTION, CODE_NAMESPACE, GEN_AI_USAGE_CACHE_READ_TOKENS, GEN_AI_USAGE_CACHE_WRITE_TOKENS, GEN_AI_USAGE_TOTAL_TOKENS) are also removed from src/semconv.ts.

Consumer action required:

  • Switch Dynatrace dashboards / DQL queries from gen_ai.system to gen_ai.provider.name.
  • Replace any filter on code.function / code.namespace with code.function.name (combined form) or code.file.path.
  • Update cache-token panels to gen_ai.usage.cache_read.input_tokens / gen_ai.usage.cache_creation.input_tokens.
  • Compute totals as gen_ai.usage.input_tokens + gen_ai.usage.output_tokensgen_ai.usage.total_tokens is gone.

The resource attribute openclaw.schema.version now carries 1.3.0 on every signal so consumers can gate queries on the schema cut-over.

Resource identity (ISI-995)

The trace, metric, and log Resources all carry:

  • service.version resolved at module load from openclaw.plugin.json's version field — the legacy hard-coded "0.1.0" placeholder is gone, so version-comparison dashboards now see real plugin releases.
  • An OTel semconv schema_url (currently https://opentelemetry.io/schemas/1.39.0, pinned to the installed @opentelemetry/semantic-conventions version) so backends can resolve attribute names against the right registry generation.

Log-attribute hygiene (ISI-995)

Bridged log records emit OTel-stable code.function.name, code.file.path, and code.line.number for the emit site, replacing the older openclaw.log.function, openclaw.log.file, and openclaw.log.line triplet (which duplicated the same semantics in a non-portable namespace and confused log-pipeline filters keyed on code.*).

The pipeline no longer emits openclaw.log.trace_id, openclaw.log.span_id, or openclaw.log.trace_flags either — those fields are already on the OTLP LogRecord itself when the active context is passed to emit(), so the duplicate attribute lines were silent double-records.


Performance Considerations

Batching

Both plugins use batched export: - Traces: BatchSpanProcessor (default 5s or 512 spans) - Metrics: PeriodicExportingMetricReader (default 60s) - Logs: BatchLogRecordProcessor (default 5s)

Overhead

Plugin Overhead Source
Official Event subscription, metric/span creation
Custom Hook interception, context map management

Both are lightweight — the OTel SDK handles batching efficiently.

Sampling

Reduce trace volume with sampleRate:

{
  "diagnostics": {
    "otel": {
      "sampleRate": 0.1  // 10% of traces
    }
  }
}

When to Use Each

Use Case Recommended
Production monitoring Official
Cost/token dashboards Official
Gateway health alerts Official
Debugging specific requests Custom
Understanding agent behavior Custom
Tool execution analysis Custom
Complete observability Both