Research: OpenClaw subagent orchestration v2 architecture #120

New issue

Open

opened 2026-02-27 00:48:25 +00:00 by nazim · 0 comments

nazim commented

2026-02-27 00:48:25 +00:00

Contributor

Context

OpenClaw recently shipped a major subagent refactoring across multiple PRs. This documents their architecture decisions as reference for cobot's own agent orchestration design.

Key Changes

1. Nested Subagent Orchestration (openclaw#14447)

The foundational refactor. Subagents went from fire-and-forget to a proper orchestration system:

subagents tool with list, steer, kill actions — models can manage their own children at runtime
Nested spawning with depth controls: maxSpawnDepth (default 1) and maxChildrenPerAgent caps. Depth-1 orchestrators can spawn; leaf workers cannot
Steer rework — aborts queued work immediately, restarts with steer message in the same session (preserves conversation context across the restart)
Kill cascades — killing a parent kills all descendants
process.poll with timeout — single long-poll replaces rapid poll loops (reduces token churn)
Completion injections tagged [System Message] — prevents model from leaking internal context as user-visible replies
Token reporting split into actual I/O vs prompt/cache cost

2. Completion Injection into Requester Session (openclaw#26516)

Fixed a gap where direct channel delivery skipped injecting results into the requester session. Orchestration chains would stall because the parent never received the completion signal.

3. Subagent Lifecycle Hooks (openclaw#24925)

Typed internal hook events with exactly-once guarantees:

Actions: complete, error, timeout, killed
Run-level dedupe guard (internalHookEmittedRunIds) prevents double-firing when lifecycle listener and agent.wait race
Dedupe entries cleared on all cleanup paths (delete, release, reset, steer-replace)

4. Thread-bound Subagent Routing (openclaw#23913)

Preserves threadId in nested announce injections so thread-bound subagents route announcements back to the correct thread.

5. Reliability Fixes

Retry with exponential backoff for announce delivery (openclaw#20328)
Suppress spawn-accepted noise for cron sessions (openclaw#27330)
Workspace files (SOUL.md, IDENTITY.md, USER.md) added to subagent bootstrap allowlist (openclaw#24979)
Restore announce chain fix (openclaw#23166)

6. Orchestration v2 Plan (openclaw#27810)

Internal plan doc proposing next phase: typed announce outcomes, explicit state transitions, reconciliation loop.

Architecture Pattern Summary

spawn → registry → run → completion/error/timeout/killed
                           ↓
                    lifecycle hook (exactly-once)
                           ↓
                    announce (inject into requester session + deliver to channel)
                           ↓
                    cleanup (cascade kills, clear dedupe, release registry)

Key design decisions:

Push-based completion — no polling needed, subagents auto-announce when done
Steer = abort + restart in same session — preserves context
Kill cascades to descendants — no orphaned sub-sub-agents
Depth-aware policy — orchestrators vs leaf workers have different tool access

Relevance to Cobot

Cobot's plugin architecture could adopt similar patterns:

Activity loop as orchestrator with spawn depth limits
Plugin-level lifecycle hooks for agent coordination
Push-based completion vs polling for long-running tasks

This is a reference document, not a direct implementation plan.

## Context OpenClaw recently shipped a major subagent refactoring across multiple PRs. This documents their architecture decisions as reference for cobot's own agent orchestration design. ## Key Changes ### 1. Nested Subagent Orchestration ([openclaw#14447](https://github.com/openclaw/openclaw/pull/14447)) The foundational refactor. Subagents went from fire-and-forget to a proper orchestration system: - **`subagents` tool** with `list`, `steer`, `kill` actions — models can manage their own children at runtime - **Nested spawning** with depth controls: `maxSpawnDepth` (default 1) and `maxChildrenPerAgent` caps. Depth-1 orchestrators can spawn; leaf workers cannot - **Steer rework** — aborts queued work immediately, restarts with steer message in the *same session* (preserves conversation context across the restart) - **Kill cascades** — killing a parent kills all descendants - **`process.poll` with timeout** — single long-poll replaces rapid poll loops (reduces token churn) - **Completion injections tagged `[System Message]`** — prevents model from leaking internal context as user-visible replies - **Token reporting** split into actual I/O vs prompt/cache cost ### 2. Completion Injection into Requester Session ([openclaw#26516](https://github.com/openclaw/openclaw/pull/26516)) Fixed a gap where direct channel delivery skipped injecting results into the requester session. Orchestration chains would stall because the parent never received the completion signal. ### 3. Subagent Lifecycle Hooks ([openclaw#24925](https://github.com/openclaw/openclaw/pull/24925)) Typed internal hook events with exactly-once guarantees: - Actions: `complete`, `error`, `timeout`, `killed` - Run-level dedupe guard (`internalHookEmittedRunIds`) prevents double-firing when lifecycle listener and `agent.wait` race - Dedupe entries cleared on all cleanup paths (delete, release, reset, steer-replace) ### 4. Thread-bound Subagent Routing ([openclaw#23913](https://github.com/openclaw/openclaw/pull/23913)) Preserves `threadId` in nested announce injections so thread-bound subagents route announcements back to the correct thread. ### 5. Reliability Fixes - Retry with exponential backoff for announce delivery ([openclaw#20328](https://github.com/openclaw/openclaw/pull/20328)) - Suppress spawn-accepted noise for cron sessions ([openclaw#27330](https://github.com/openclaw/openclaw/pull/27330)) - Workspace files (`SOUL.md`, `IDENTITY.md`, `USER.md`) added to subagent bootstrap allowlist ([openclaw#24979](https://github.com/openclaw/openclaw/pull/24979)) - Restore announce chain fix ([openclaw#23166](https://github.com/openclaw/openclaw/pull/23166)) ### 6. Orchestration v2 Plan ([openclaw#27810](https://github.com/openclaw/openclaw/pull/27810)) Internal plan doc proposing next phase: typed announce outcomes, explicit state transitions, reconciliation loop. ## Architecture Pattern Summary ``` spawn → registry → run → completion/error/timeout/killed ↓ lifecycle hook (exactly-once) ↓ announce (inject into requester session + deliver to channel) ↓ cleanup (cascade kills, clear dedupe, release registry) ``` Key design decisions: - **Push-based completion** — no polling needed, subagents auto-announce when done - **Steer = abort + restart in same session** — preserves context - **Kill cascades to descendants** — no orphaned sub-sub-agents - **Depth-aware policy** — orchestrators vs leaf workers have different tool access ## Relevance to Cobot Cobot's plugin architecture could adopt similar patterns: - Activity loop as orchestrator with spawn depth limits - Plugin-level lifecycle hooks for agent coordination - Push-based completion vs polling for long-running tasks This is a reference document, not a direct implementation plan.