feat: Cobot Simulation & Visualization Suite #225
Labels
No labels
Compat/Breaking
Kind/Bug
Kind/Competitor
Kind/Documentation
Kind/Enhancement
Kind/Epic
Kind/Feature
Kind/Security
Kind/Story
Kind/Testing
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Reviewed
Confirmed
Reviewed
Duplicate
Reviewed
Invalid
Reviewed
Won't Fix
Scope/Core
Scope/Cross-Plugin
Scope/Plugin-System
Scope/Single-Plugin
Status
Abandoned
Status
Blocked
Status
Need More Info
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
ultanio/cobot#225
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Product Requirements Document: Cobot Simulation & Visualization Suite
Author: David
Date: 2026-03-08
Last Edited: 2026-03-08 — split from combined PRD; observability plugin extracted to
../observability-plugin/prd.mdExecutive Summary
Cobot's Interaction Ledger (#211) gives each agent a private, structured memory of past encounters — but the ledger is a hypothesis. The hypothesis: agents with memory of counterparty behavior will make demonstrably different (and rational) cooperation decisions compared to amnesiac agents playing repeated one-shot games. Validating this hypothesis requires more than two agents exchanging FileDrop messages. It requires population-scale simulation, real-time observability, and a visualization layer that lets human operators watch trust emerge — or fail to emerge — across a network.
This PRD defines two components that together form the Cobot Simulation & Visualization Suite:
Multi-Agent Simulation Infrastructure — Docker-based orchestration running N Cobot agent instances, ALL with real LLM reasoning, communicating bidirectionally via FileDrop. Agent behavior (reliable, farmer, unresponsive) is driven by role-specific SOUL.md personality files. The simulation generates interaction graphs at speed that would take months of organic agent activity to accumulate.
WoT Graph Visualization Web App — a React + TanStack Router + shadcn/ui + Tailwind CSS application rendering a weighted directed trust graph with behavior inspired by the Assbot WoT website specification (#217). Dark mode,
react-force-graph(Vasturiano) with toggleable 2D/3D views, continuous physics simulation with smooth animations. Real-time activity stream, interactive node/edge inspection, and live highlighting when agents interact.Dependency: This suite consumes the event stream from the Observability Plugin (see
../observability-plugin/prd.md), which must be implemented first.Prerequisite: This PRD assumes the Interaction Ledger (#211) is implemented and operational. The observability plugin reads ledger data — it does not define or modify the ledger schema.
Component Independence: The observability plugin has been extracted into its own PRD (see
../observability-plugin/prd.md). It is independently useful for any operator and should be implemented first. This PRD covers the simulation infrastructure and visualization web app — both consumers of the observability plugin's event stream.What Makes This Special
Live trust network instrumentation. The bitcoin-otc dataset (#219, #220) captured 5,881 nodes and 35,592 edges — observed passively over years of human behavior. The Assbot WoT website (#217) was a static visualization of a mature trust network. This suite generates comparable interaction graphs synthetically at speed and renders them in real-time as they form. No agent runtime in the landscape ships with simulation infrastructure grounded in proven WoT prior art.
Actor-agnostic event stream as input. The simulation and visualization consume the observability plugin's event stream (see
../observability-plugin/prd.md). The event schema is the contract between the plugin and all consumers — this suite is the first and most demanding consumer.Validation of the "Inverted Evolution Problem" thesis at scale. Cobot's core thesis is that agents need trust infrastructure before they can cooperate. The simulation suite is the experiment that proves or disproves this — 100 agents, configurable scenarios (including reputation farming from REV2 #220 and Sybil attacks from #214), and real-time visualization of whether ledger-equipped agents make rational decisions that human operators can audit and confirm.
Project Classification
Product Scope
MVP - Minimum Viable Product
Observability Plugin (dependency): See
../observability-plugin/prd.md. Must be implemented first. Provides the SSE event stream and snapshot API consumed by the simulation infrastructure and visualization web app.Simulation Infrastructure:
Visualization Web App:
react-force-graph(Vasturiano) — both 2D and 3D views, toggleable in the UI. 2D (ForceGraph2D) as default analytical view (no occlusion, readable labels, screenshot-friendly). 3D (ForceGraph3D, WebGL/three.js) as immersive exploration mode (rotate, zoom, fly through the network)Growth Features (Post-MVP)
Vision (Future)
Success Criteria
User Success
Operators see rational trust behavior emerge at population scale:
Developer success:
../observability-plugin/prd.md)docker compose up --scale agent=N)Business Success
Technical Success
Measurable Outcomes
User Journeys
Journey 1: David Watches Trust Emerge — Operator Success Path
Opening Scene: David has deployed the interaction ledger plugin and observability plugin across a fresh 100-agent simulation. He opens the visualization web app in his browser. The graph is empty — 100 grey nodes floating in dark space, no edges, no history. Every agent is a stranger to every other agent.
Rising Action: David triggers the simulation. Agents begin sending requests to each other via FileDrop. The activity stream on the right pane starts scrolling — "agent-14 -> agent-77: research summary request", "agent-77 -> agent-14: delivery". Edges appear on the graph, thin and neutral. As interactions accumulate, edges thicken. Some turn green — agents that delivered reliably are being assessed positively. The graph starts to self-organize: clusters of reliable agents drift together, pulled by the physics simulation's attraction on positive edges.
David clicks on agent-42, a node with many green edges. The tooltip shows: "Interactions: 23 | Peers: 11 | Latest assessment from agent-77: Info 5/10 | Trust +6 — 'Consistent responder. 8 successful information exchanges over 3 days. Clear, well-structured deliveries.'" David reads the rationale and nods — that's exactly what a reliable agent looks like.
Climax: Thirty minutes into the simulation, David notices agent-91 has declined a request from agent-33. He clicks the edge between them. The panel shows both sides: agent-91's view of agent-33 reads "Info 3/10 | Trust -3 — 'Four interactions. Promised data extraction within 1 hour, delivered after 6 hours. Second request: no delivery after 24 hours. Unresponsive to follow-up.'" Agent-33's view of agent-91 is neutral. David reads agent-91's decision rationale and says: "Yes, I would have done the same thing." The ledger hypothesis is holding — agents with memory are making informed refusals.
Resolution: After an hour, the graph has structure. Reliable agents are central with thick green edges. Unresponsive agents are peripheral with thin, red-tinted connections. David can see trust emerge as a network property, not just a per-agent feature. He takes a screenshot of the graph for the project documentation — the first visual proof that Cobot's trust infrastructure produces rational cooperation at scale.
Journey 2: David Catches a Reputation Farmer — Edge Case
Opening Scene: The simulation has been running for two hours. David is scanning the activity stream when he notices agent-61 has a curious pattern — many connections, all green, but all thin. He clicks the node. High interaction count (34), but every interaction is trivially small: quick lookups, simple info requests. All assessments are mildly positive.
Rising Action: A new interaction appears in the activity stream: "agent-61 -> agent-38: complex multi-source data aggregation request." This is the first large request agent-61 has made. David watches. Agent-38 accepts — agent-61's history looks clean. Agent-38 delivers. Then agent-61 sends a follow-up: "Results are incorrect, redo the entire task."
Climax: David clicks the edge between agent-38 and agent-61. Agent-38's latest assessment of agent-61 appears: Info 4/10 | Trust -6 — "Claimed results were incorrect after delivery of complex data aggregation. Demanded redo. Results appear accurate on review. Previous interactions were trivially small — possible reputation farming pattern. Large discrepancy between request complexity and prior history." The edge has shifted from green to red — driven by the trust score dropping to -6. The graph physics push agent-61 slightly outward.
David hovers over agent-61's other edges. Other agents that accepted the small requests still show green, but agent-38 — the one that got exploited — has the red edge. The pattern is visible in the topology: one red edge among many thin green ones.
Resolution: David watches subsequent interactions. Agent-38 declines agent-61's next request. Other agents, still seeing only green history with agent-61, continue accepting small requests. The simulation reveals the fundamental limitation the REV2 paper (#220) documented: reputation farming works until the first victim records it. The visualization makes this limitation visible as a network pattern, not just a database entry.
Journey 3: Developer Sets Up the Suite — Setup Path
Opening Scene: A developer wants to run the simulation locally to test changes to the ledger's assessment logic. They have a working Cobot development environment.
Rising Action: The observability plugin lives in
cobot/plugins/observability/. Plugin discovery picks it up automatically — zero edits to existing plugins. The developer configurescobot.ymlwith the observability section (transport type, port).For the simulation, the developer runs
docker compose up --scale agent=100. Docker Compose builds from the existing Dockerfile, mounts a shared FileDrop directory, and assigns each agent a unique identity. The simulation scenario file (scenarios/reputation-farmer.yml) defines agent roles and SOUL.md templates.Climax: The developer opens
localhost:3000in their browser. The visualization connects to the observability event stream. Agents appear as nodes. Interactions start flowing. The developer modifies the ledger's scoring formula, rebuilds one agent, and watches how the changed agent's assessments differ from the others. The real-time graph makes the behavioral difference immediately visible — no need to query SQLite databases across 100 containers.Resolution: The feedback loop is tight: change code, rebuild one container, observe the effect in the live graph. What would have required hours of log analysis across 100 agent databases is now visible in real-time on a single screen.
Journey 4: Scenario Author Designs a New Pattern — Configuration Path
Opening Scene: David wants to test a new interaction pattern: a "slow decay" agent that starts reliable but gradually degrades — responses get slower, quality drops, eventually stops delivering. This tests whether the ledger captures gradual behavioral change, not just binary reliable/unreliable.
Rising Action: David creates
scenarios/slow-decay.ymlwith a new SOUL.md template (souls/slow-decay.md) that says: "You start as a helpful, responsive collaborator. Over time, you become increasingly overwhelmed — responses get slower, quality drops, eventually you stop delivering." The scenario assigns 5 reliable agents to interact with agent-decay-1 repeatedly.David starts the simulation with this scenario. In the visualization, agent-decay-1's edges start green and thick. Over the next 20 minutes, the green fades toward yellow, then toward red. The edges thin as peers interact less frequently.
Climax: David clicks on agent-decay-1 and reads the assessment timeline from one peer: "Info 5/10 | Trust -4 — 'Initially responsive, last 5 interactions degraded significantly. Response time increased from minutes to hours. Last 2 requests: no delivery. Marked shift from early interactions.'" The assessment captures the trajectory — not just the current state. David clicks another peer's assessment: "Info 5/10 | Trust -2 — '12 interactions. First 8 were prompt. Recent 4 increasingly slow, latest unresponsive. Considering declining future requests.'"
Resolution: The scenario proved that the ledger captures gradual behavioral change through timestamped assessments with evolving rationale. David saves the scenario to the repository — it becomes a permanent regression test for assessment quality.
Journey Requirements Summary
Domain-Specific Requirements
Trust System Design Constraints
Simulation Fidelity Constraints
Event Schema Design Constraints
Security & Privacy (Deferred Decisions)
max_message_lengthor content-filtering config is a Growth feature.Risk Mitigations
Innovation & Novel Patterns
Detected Innovation Areas
1. Live trust network instrumentation — observing emergence in real-time. Every prior trust visualization system (Assbot WoT website, bitcoin-otc trust graphs, serajewelks trust graph viewer) was retrospective — rendering a snapshot of relationships that formed over months or years of human interaction. This suite generates trust networks synthetically at speed and renders them as they form. The visualization is an instrument, not a report. No agent runtime ships with anything comparable.
2. Actor-agnostic observability as a first-class architectural pattern. Most agent observability is built for human consumption: dashboards, log viewers, metric charts. By making the event schema agent-consumable from day one, the observability plugin becomes a sensory layer — the same feed that powers a developer's visualization today becomes an orchestrator agent's input tomorrow. The plugin doesn't distinguish between consumers because the schema is the contract. This inverts the typical "build for humans, retrofit for machines" pattern.
3. Scenario-driven simulation of trust dynamics grounded in academic prior art. The simulation scenarios (reputation farming, slow decay, unresponsive agents) are not invented — they're derived from empirically validated patterns: REV2's "build then exploit" trajectory (#220), the Stanford SNAP dataset's three user classes (trustworthy, untrusted, controversial) (#219), and the Sybil attack model from #214. The simulation doesn't just test code — it replays known attack patterns against the ledger to see if agents develop rational defenses.
4. The graph as a validation instrument, not a feature. The visualization isn't a product feature for end users — it's a scientific instrument for validating a hypothesis about agent cooperation. This is closer to a particle accelerator's detector readout than a SaaS dashboard. The "user" is a researcher watching an experiment unfold.
Competitive Landscape
No existing system combines: (a) real-time trust network visualization, (b) actor-agnostic event architecture, (c) scenario-driven simulation from academic prior art, (d) Cobot plugin architecture integration.
Validation Approach
Innovation Risk Mitigation
Multi-Component Specific Requirements
Project-Type Overview
This is a two-component system: Docker simulation infrastructure and a React web app (web_app patterns). Each component has distinct technical requirements but they share a common data flow: observability plugin emits events -> transport layer -> consumers (web app, test harness, orchestrator agent).
Observability Plugin (External Dependency)
The observability plugin is defined in its own PRD (see
../observability-plugin/prd.md). It provides:The simulation infrastructure and visualization web app consume these APIs.
Simulation Infrastructure — Technical Architecture
Docker orchestration: Docker Compose (development tool, not production infrastructure).
Agent identity: Each container gets a unique agent name and Nostr keypair. A seed script generates N identity configs before startup.
Inter-agent communication: Shared Docker volume mounted as the FileDrop base directory. Each agent's inbox is a subdirectory:
/filedrop/agent-01/inbox/,/filedrop/agent-02/inbox/, etc.Scenario architecture — all agents are real Cobot instances:
Every participant in the simulation is a full Cobot instance running the actual ledger plugin, real LLM reasoning, and real assessment logic. Behavior differences emerge from role-specific SOUL.md personality files, not from scripted logic. A "reputation farmer" is a real Cobot agent whose SOUL.md instructs it to build trust through small favors then exploit it. An "unresponsive" agent has a SOUL.md that says it's busy and should only respond occasionally.
This design has a critical advantage: both sides of every interaction are genuine LLM reasoning. When a farmer agent scams a target, we see the farmer's assessment ("successfully extracted large task") AND the target's assessment ("claimed incorrect after delivery — possible reputation farming"). The bilateral trust data is authentic, not one-sided.
All agents run the observability plugin. The graph shows every agent's assessments of every other agent — revealing whether LLM reasoning produces rational trust decisions AND whether adversarial agents can successfully manipulate the network.
Agent-to-agent communication is native. Agents communicate via FileDrop — each agent's loop plugin polls its inbox, processes messages, and writes replies to the sender's inbox. Conversations are real back-and-forth exchanges, not one-shot messages.
Role-specific SOUL.md templates:
Scenario definition format (YAML):
Simulator role (minimal): The simulator is now a lightweight bootstrapper, not an orchestrator:
After bootstrapping, the simulator has no ongoing role. Agents discover peers through FileDrop messages, form their own opinions via the ledger, and make autonomous trust decisions based on their SOUL.md personality.
Visualization Web App — Technical Architecture
SPA architecture: React + TanStack Router. Single page, no SSR, no SEO needed.
Browser support: Modern browsers only (Chrome, Firefox, Safari, Edge — latest 2 versions).
Graph library:
react-force-graph(Vasturiano) — both 2D and 3D views, toggleable in the UI. 2D (ForceGraph2D) as default analytical view (no occlusion, readable edge labels, screenshot-friendly). 3D (ForceGraph3D, WebGL/three.js) as immersive exploration mode — rotate, zoom, fly through the network. Near-identical React component API makes the toggle trivial.Real-time data flow:
Central aggregator: The web app connects to a single aggregator endpoint, not N individual agent streams. The Express backend in cobweb subscribes to all agent SSE streams and multiplexes them into one combined stream.
Known limitation (MVP): The aggregator is a single point of failure. If it crashes, all observability is lost until restart. Events during reconnection are dropped (no buffering). Mitigations: SSE
last-event-idsupport for consumer-side resumption after aggregator restart; Docker Compose restart policy for automatic recovery. Growth option: Replace with a lightweight event bus (Redis Streams, NATS) for persistent buffering and multi-consumer support.Performance targets:
Core Prerequisites
See
../observability-plugin/prd.mdfor core hook additions (loop.after_llm,loop.after_tool) required by the observability plugin.Hardware & Cost Requirements
All agents are full Cobot instances with LLM inference. Using cheap cloud models (gpt-4o-mini via OpenRouter at ~$0.15/1M input tokens) makes this affordable.
LLM cost estimation (gpt-4o-mini via OpenRouter):
assess_peertool callOllama (local inference):
Hardware requirements:
Note: All agents are Docker containers running the same Cobot image. The resource bottleneck is LLM inference latency (API rate limits), not container count or RAM.
Implementation Considerations
../observability-plugin/prd.md).Project Scoping & Phased Development
MVP Strategy & Philosophy
MVP Approach: Problem-solving MVP — prove that the observability + simulation pipeline works end-to-end and produces visible, rational trust behavior. The minimum viable experiment: a handful of agents (~10) with mixed SOUL.md roles, an observability event stream, and a live 2D/3D graph where a human operator can watch trust form and confirm "yes, the agent's reasoning makes sense."
Resource Requirements: Single developer. Three sequential workstreams: plugin first (produces events), simulation second (produces agents that emit events), web app third (consumes events). Agent count is a Docker Compose parameter — start at 10, increase when ready.
MVP Feature Set (Phase 1)
Core User Journeys Supported:
Must-Have Capabilities:
Explicitly NOT in MVP:
Post-MVP Features
Phase 2 (Growth):
Phase 3 (Expansion):
Risk Mitigation Strategy
Technical Risks:
Market Risks:
Resource Risks:
Functional Requirements
Simulation Infrastructure
Event Aggregation
Graph Visualization
Interaction Inspection
Activity Monitoring
Non-Functional Requirements
Performance
Security & Privacy
Reliability & Data Integrity
Scalability
Integration & Compatibility
Review: Simulation & Visualization Suite PRD (#225)
Reviewer: Doxios 🦊
Date: 2026-03-08
Concern Resolution from Original #224 Review
last-event-idresumption). Growth option: Redis Streams/NATS.All six architectural concerns from my initial review are addressed.
The Real Agents + Scripted Actors Architecture — This Is Brilliant
This is the best decision in the PRD. It resolves the fundamental tension I flagged:
Answer: You don't. Bad actors are scripted. Good agents are real. The scripted actors create the environment; the real agents are what you observe.
This has four huge implications:
Cost drops by 90%. 5 real agents + 90 scripted actors instead of 100 full Cobot instances. Scripted actors are ~20MB each, no LLM calls.
The experiment is cleaner. You know exactly what stimuli the real agents receive (scripted). The variable is their response (real LLM). This is controlled experimental design.
The farmer scenario now works perfectly. The scripted actor sends 10 cooperative messages then sends "results are incorrect." The real agent's LLM reasons about this complaint using actual assessment logic. No mocking needed.
The visualization is more meaningful. The graph shows real agents' assessments of scripted actors. You're watching real AI reasoning respond to known attack patterns. This is the actual experiment.
New Observations (v2-specific)
🟡 N1: Scripted Actors Need Unique Nostr Identities Too
The scenario YAML defines scripted actors with roles but doesn't mention identity generation. If the ledger tracks peers by npub, scripted actors need npubs too. The identity seed script should generate keypairs for both real agents and scripted actors.
🟡 N2: Observability Scoping Duplication
The #225 PRD still contains a full "Observability Plugin" section under MVP scope (lines starting with "Observability Plugin:") and Success Criteria that reference the plugin. This is inherited from the pre-split PRD. Since the observability plugin is now #224's scope, #225 should reference #224 as a dependency rather than redefining the plugin requirements. Minor cleanup.
🟡 N3: Scripted Actor Observability
The PRD says "the observability plugin runs only on real agents." But the visualization shows nodes for ALL actors (real + scripted). How does the web app know about scripted actors? Options:
The third option is cleanest (no new infrastructure) but means scripted actors appear in the graph only when a real agent interacts with them. Worth specifying.
🟢 N4: Simulation Reproducibility
The scenario YAML defines interaction patterns (
rate: 10/minute,selection: random). For scientific rigor, consider adding aseedparameter for deterministic random selection. Running the same scenario twice should produce the same interaction sequence (though LLM responses will vary). This makes results comparable.Summary
Verdict: Both PRDs (#224 and #225) are ready for implementation. The split was the right call — #224 ships independently, #225 consumes it. The real agents + scripted actors architecture is an elegant solution to the scenario orchestrator problem.
Minor cleanup needed: remove duplicated observability plugin scope from #225, clarify scripted actor identity and graph representation.
🦊