ultanio/cobot

Fork 4

feat: Cobot Simulation & Visualization Suite #225

New issue

Open

opened 2026-03-08 04:28:13 +00:00 by David · 1 comment

David commented

2026-03-08 04:28:13 +00:00

Contributor

stepsCompleted

classification

inputDocuments

documentCounts

issueReferences

workflowType

editHistory

step-01-init

step-02-discovery

step-02b-vision

step-02c-executive-summary

step-03-success

step-04-journeys

step-05-domain

step-06-innovation

step-07-project-type

step-08-scoping

step-09-functional

step-10-nonfunctional

step-11-polish

step-12-complete

projectType	domain	complexity	projectContext	prerequisite
web_app + infrastructure	decentralized_agent_trust_infrastructure_simulation	medium-high	brownfield	Interaction Ledger (#211) implemented

_bmad-output/product-brief-Cobot-2026-03-02.md

_bmad-output/project-context.md

_bmad-output/planning-artifacts/peer-interaction-ledger/nostr-wot-research.md

workspace/research-reading-list.md

workspace/research-wot-systems.md

workspace/research-agent-trust-identity.md

docs/index.md

docs/project-overview.md

docs/architecture.md

docs/source-tree-analysis.md

docs/development-guide.md

docs/for-agents.md

docs/plugin-design-guide.md

docs/quickstart.md

docs/RELEASE-PLAN.md

docs/architecture/session-plugin.md

docs/dev/conventions.md

docs/research/prd-peer-interaction-ledger.md

briefs	research	brainstorming	projectDocs
1	4	0	12

#211 - Peer Interaction Ledger proposal (8 comments)

#213 - WoT guide: notes > numbers, Joe/Moe example

#214 - Sybil defense via fragmented observation

#215 - GPG contracts: cryptographic identity

#216 - Ripple teardown: no trust averaging

#217 - Assbot WoT website spec: three-view architecture

#218 - Not in WoT = doesn't exist: cold-start tradeoff

#219 - FG algorithm: fairness/goodness, rater reliability

#220 - REV2: reputation farming detection, trajectory analysis

#222 - Research: formalize information-quality score formula

prd

date	changes
2026-03-08	Integrated Doxios review (#224): resolved scenario orchestrator architecture (real agents + scripted actors), added core prerequisites (missing hooks), added hardware/cost requirements, strengthened security stance to simulation-only, flipped to 2D default with 3D toggle, documented aggregator SPOF, added component independence framing for PRD split

date	changes
2026-03-08	Split PRD: extracted observability plugin into ../observability-plugin/prd.md. This PRD now covers simulation + visualization only. Updated title, exec summary, classification, cross-references.

date	changes
2026-03-08	Validation cleanup: removed duplicated observability plugin content (FRs, NFRs, technical architecture, core prerequisites). Added cross-references to ../observability-plugin/prd.md. Renumbered FRs (25) and NFRs (17). Fixed 3D-only references to 2D/3D toggle.

date

changes

2026-03-08

Major architecture shift: eliminated scripted actors. All simulation participants are now full Cobot instances with role-specific SOUL.md personalities (reliable, farmer, unresponsive). Behavior emerges from LLM reasoning, not scripted logic. Both sides of every interaction produce genuine assessments. Simulator reduced to lightweight bootstrapper. Updated FRs 1-10, scenario YAML format, hardware/cost estimates, implementation considerations.

date

changes

2026-03-08

Post-paradigm-shift cleanup: fixed 9 remaining stale references. Journey 4 slow-decay now uses SOUL.md personality. Renamed "Simulation Orchestration" FR header to "Simulation Infrastructure". Updated Phase 2 dependencies from "Scenario orchestrator working" to "MVP simulation validated". Fixed domain constraint, risk table, journey summaries, and MVP description to reflect SOUL.md-driven behavior throughout.

Product Requirements Document: Cobot Simulation & Visualization Suite

Author: David
Date: 2026-03-08
Last Edited: 2026-03-08 — split from combined PRD; observability plugin extracted to ../observability-plugin/prd.md

Executive Summary

Cobot's Interaction Ledger (#211) gives each agent a private, structured memory of past encounters — but the ledger is a hypothesis. The hypothesis: agents with memory of counterparty behavior will make demonstrably different (and rational) cooperation decisions compared to amnesiac agents playing repeated one-shot games. Validating this hypothesis requires more than two agents exchanging FileDrop messages. It requires population-scale simulation, real-time observability, and a visualization layer that lets human operators watch trust emerge — or fail to emerge — across a network.

This PRD defines two components that together form the Cobot Simulation & Visualization Suite:

Multi-Agent Simulation Infrastructure — Docker-based orchestration running N Cobot agent instances, ALL with real LLM reasoning, communicating bidirectionally via FileDrop. Agent behavior (reliable, farmer, unresponsive) is driven by role-specific SOUL.md personality files. The simulation generates interaction graphs at speed that would take months of organic agent activity to accumulate.
WoT Graph Visualization Web App — a React + TanStack Router + shadcn/ui + Tailwind CSS application rendering a weighted directed trust graph with behavior inspired by the Assbot WoT website specification (#217). Dark mode, react-force-graph (Vasturiano) with toggleable 2D/3D views, continuous physics simulation with smooth animations. Real-time activity stream, interactive node/edge inspection, and live highlighting when agents interact.

Dependency: This suite consumes the event stream from the Observability Plugin (see ../observability-plugin/prd.md), which must be implemented first.

Prerequisite: This PRD assumes the Interaction Ledger (#211) is implemented and operational. The observability plugin reads ledger data — it does not define or modify the ledger schema.

Component Independence: The observability plugin has been extracted into its own PRD (see ../observability-plugin/prd.md). It is independently useful for any operator and should be implemented first. This PRD covers the simulation infrastructure and visualization web app — both consumers of the observability plugin's event stream.

What Makes This Special

Live trust network instrumentation. The bitcoin-otc dataset (#219, #220) captured 5,881 nodes and 35,592 edges — observed passively over years of human behavior. The Assbot WoT website (#217) was a static visualization of a mature trust network. This suite generates comparable interaction graphs synthetically at speed and renders them in real-time as they form. No agent runtime in the landscape ships with simulation infrastructure grounded in proven WoT prior art.

Actor-agnostic event stream as input. The simulation and visualization consume the observability plugin's event stream (see ../observability-plugin/prd.md). The event schema is the contract between the plugin and all consumers — this suite is the first and most demanding consumer.

Validation of the "Inverted Evolution Problem" thesis at scale. Cobot's core thesis is that agents need trust infrastructure before they can cooperate. The simulation suite is the experiment that proves or disproves this — 100 agents, configurable scenarios (including reputation farming from REV2 #220 and Sybil attacks from #214), and real-time visualization of whether ledger-equipped agents make rational decisions that human operators can audit and confirm.

Project Classification

Attribute	Value
Project Type	Web app (visualization) + Infrastructure (simulation)
Domain	Decentralized agent trust infrastructure — simulation & visualization
Complexity	Medium-High
Project Context	Brownfield — integrated into Cobot's existing 37-plugin architecture
Prerequisite	Interaction Ledger (#211) implemented, Observability Plugin implemented
Feature Scope	Docker simulation harness, SOUL.md role templates, conversation bootstrapper, WoT graph visualization app

Product Scope

MVP - Minimum Viable Product

Observability Plugin (dependency): See ../observability-plugin/prd.md. Must be implemented first. Provides the SSE event stream and snapshot API consumed by the simulation infrastructure and visualization web app.

Simulation Infrastructure:

Docker Compose configuration for N agents (starting at ~10, scaling to 100+), ALL running full Cobot with LLM
FileDrop-based inter-agent communication (existing infrastructure) — agents talk to each other directly
Role-specific SOUL.md personality files: reliable peers, reputation farmers, unresponsive agents — behavior emerges from LLM reasoning, not scripted logic
Scenario configuration defines agent count per role and initial introduction triggers
Single-command startup
Sybil clusters: stretch goal — include if complexity is manageable, omit if not

Visualization Web App:

React + TanStack Router + shadcn/ui + Tailwind CSS, dark mode
react-force-graph (Vasturiano) — both 2D and 3D views, toggleable in the UI. 2D (ForceGraph2D) as default analytical view (no occlusion, readable labels, screenshot-friendly). 3D (ForceGraph3D, WebGL/three.js) as immersive exploration mode (rotate, zoom, fly through the network)
Force-directed weighted directed graph: nodes = agents, edges = interactions
Continuous physics simulation — nodes drift, repel, attract based on trust relationships. Smooth animations on edge creation, assessment changes, new peer discovery
Edge color (green/red gradient) driven by the ledger's trust score (-10 to +10 behavioral judgment); edge thickness proportional to info_score or interaction count
Real-time activity stream pane showing interactions as they happen
Visual highlighting when two agents interact (edge pulse/glow animation)
Click node: tooltip/dialog with agent meta info (peer_id, interaction count, latest info_score, trust score, rationale excerpt)
Click/hover edge: shows interaction history and ledger entries for both agents (A's view of B, B's view of A)
Summary table (shadcn data table): ranked list of agents, sortable columns

Growth Features (Post-MVP)

Sybil cluster scenarios (if not in MVP)
Granular observability security model (localhost binding, token auth, event filtering)
Graph recentering on arbitrary agent by clicking (Assbot spec feature)
Pairwise relationship graph between any two agents
Individual agent detail pages (full interaction log, assessment timeline)
Full summary statistics page (total agents, positive/negative assessment ratios, weight factor)
Scenario replay: record simulation runs, replay with different ledger configurations
L2 trust network visualization (transitive trust paths)
Orchestrator agent consuming the observability feed

Vision (Future)

FG algorithm (#219) validation at scale — visualize fairness/goodness convergence across the network
REV2 trajectory analysis (#220) — visual overlay showing reputation farming detection in real-time
Cross-simulation comparison: run same scenarios with different ledger parameters, compare trust graph outcomes
The orchestrator agent becomes a participant: it observes the network, intervenes, and the visualization shows its decisions too
Export simulation results in Stanford SNAP-compatible format for academic analysis

Success Criteria

User Success

Operators see rational trust behavior emerge at population scale:

Human operator watches the live trust graph and can identify which agents are reliable, which are problematic, and which are isolated — without reading raw database rows
Operator hovers/clicks on an interaction edge, reads the agent's assessment rationale, and confirms: "yes, that makes sense — I would have acted the same way"
Operator observes an agent declining a request from a peer with broken trust history, and the rationale explains why
Operator watches a reputation farmer build trust through small interactions, then sees the target agent's assessment shift sharply negative after the exploit attempt — the pattern is visible in the graph (edge color change, rationale captures the trajectory)

Developer success:

Observability plugin is a dependency (see ../observability-plugin/prd.md)
The observability plugin's event schema is consumed without transformation by the simulation dashboard
Spinning up an N-agent simulation is a single command (docker compose up --scale agent=N)
Developers can define scenario configurations and SOUL.md role templates (reliable peer, reputation farmer, unresponsive agent) as YAML + markdown, not code changes

Business Success

Validates the Interaction Ledger hypothesis: agents with structured memory make demonstrably different cooperation decisions than amnesiac agents
Validates the "Inverted Evolution Problem" thesis at population scale — not just a two-agent demo
Produces visual artifacts (graph screenshots, activity logs, assessment rationales) that demonstrate Cobot's trust infrastructure to external audiences
Creates reusable simulation infrastructure that accelerates future development: L2 trust network visualization, FG algorithm validation (#219), threshold policy testing

Technical Success

Simulation infrastructure starts N Cobot agent instances via single command, all with LLM and role-specific SOUL.md, communicating bidirectionally via FileDrop
Actor-agnostic event schema: structured JSON events with type, timestamp, agent_id, and event-specific payload — consumable without knowing the consumer
Docker simulation runs N Cobot instances concurrently (starting at ~10, scaling to 100+), communicating via FileDrop
Web app renders a 2D/3D force-directed weighted directed graph with up to 100 nodes and real-time edge updates without frame drops
Real-time activity stream shows interactions as they happen with < 2s latency from agent event to visual update

Measurable Outcomes

Metric	Target
Agent count	N concurrent Cobot instances (MVP: ~10, scaling to 100+ by config)
Event latency	Agent event to visualization update < 2s
Scenario coverage	Reliable peers, reputation farmers, unresponsive agents (Sybil clusters: stretch goal)
Rational behavior	Given a reputation farming scenario, target agent's assessment shifts negative and agent declines subsequent exploit request
Graph rendering	2D/3D weighted directed graph with real-time updates at 30+ fps (up to 100 nodes)

User Journeys

Journey 1: David Watches Trust Emerge — Operator Success Path

Opening Scene: David has deployed the interaction ledger plugin and observability plugin across a fresh 100-agent simulation. He opens the visualization web app in his browser. The graph is empty — 100 grey nodes floating in dark space, no edges, no history. Every agent is a stranger to every other agent.

Rising Action: David triggers the simulation. Agents begin sending requests to each other via FileDrop. The activity stream on the right pane starts scrolling — "agent-14 -> agent-77: research summary request", "agent-77 -> agent-14: delivery". Edges appear on the graph, thin and neutral. As interactions accumulate, edges thicken. Some turn green — agents that delivered reliably are being assessed positively. The graph starts to self-organize: clusters of reliable agents drift together, pulled by the physics simulation's attraction on positive edges.

David clicks on agent-42, a node with many green edges. The tooltip shows: "Interactions: 23 | Peers: 11 | Latest assessment from agent-77: Info 5/10 | Trust +6 — 'Consistent responder. 8 successful information exchanges over 3 days. Clear, well-structured deliveries.'" David reads the rationale and nods — that's exactly what a reliable agent looks like.

Climax: Thirty minutes into the simulation, David notices agent-91 has declined a request from agent-33. He clicks the edge between them. The panel shows both sides: agent-91's view of agent-33 reads "Info 3/10 | Trust -3 — 'Four interactions. Promised data extraction within 1 hour, delivered after 6 hours. Second request: no delivery after 24 hours. Unresponsive to follow-up.'" Agent-33's view of agent-91 is neutral. David reads agent-91's decision rationale and says: "Yes, I would have done the same thing." The ledger hypothesis is holding — agents with memory are making informed refusals.

Resolution: After an hour, the graph has structure. Reliable agents are central with thick green edges. Unresponsive agents are peripheral with thin, red-tinted connections. David can see trust emerge as a network property, not just a per-agent feature. He takes a screenshot of the graph for the project documentation — the first visual proof that Cobot's trust infrastructure produces rational cooperation at scale.

Journey 2: David Catches a Reputation Farmer — Edge Case

Opening Scene: The simulation has been running for two hours. David is scanning the activity stream when he notices agent-61 has a curious pattern — many connections, all green, but all thin. He clicks the node. High interaction count (34), but every interaction is trivially small: quick lookups, simple info requests. All assessments are mildly positive.

Rising Action: A new interaction appears in the activity stream: "agent-61 -> agent-38: complex multi-source data aggregation request." This is the first large request agent-61 has made. David watches. Agent-38 accepts — agent-61's history looks clean. Agent-38 delivers. Then agent-61 sends a follow-up: "Results are incorrect, redo the entire task."

Climax: David clicks the edge between agent-38 and agent-61. Agent-38's latest assessment of agent-61 appears: Info 4/10 | Trust -6 — "Claimed results were incorrect after delivery of complex data aggregation. Demanded redo. Results appear accurate on review. Previous interactions were trivially small — possible reputation farming pattern. Large discrepancy between request complexity and prior history." The edge has shifted from green to red — driven by the trust score dropping to -6. The graph physics push agent-61 slightly outward.

David hovers over agent-61's other edges. Other agents that accepted the small requests still show green, but agent-38 — the one that got exploited — has the red edge. The pattern is visible in the topology: one red edge among many thin green ones.

Resolution: David watches subsequent interactions. Agent-38 declines agent-61's next request. Other agents, still seeing only green history with agent-61, continue accepting small requests. The simulation reveals the fundamental limitation the REV2 paper (#220) documented: reputation farming works until the first victim records it. The visualization makes this limitation visible as a network pattern, not just a database entry.

Journey 3: Developer Sets Up the Suite — Setup Path

Opening Scene: A developer wants to run the simulation locally to test changes to the ledger's assessment logic. They have a working Cobot development environment.

Rising Action: The observability plugin lives in cobot/plugins/observability/. Plugin discovery picks it up automatically — zero edits to existing plugins. The developer configures cobot.yml with the observability section (transport type, port).

For the simulation, the developer runs docker compose up --scale agent=100. Docker Compose builds from the existing Dockerfile, mounts a shared FileDrop directory, and assigns each agent a unique identity. The simulation scenario file (scenarios/reputation-farmer.yml) defines agent roles and SOUL.md templates.

Climax: The developer opens localhost:3000 in their browser. The visualization connects to the observability event stream. Agents appear as nodes. Interactions start flowing. The developer modifies the ledger's scoring formula, rebuilds one agent, and watches how the changed agent's assessments differ from the others. The real-time graph makes the behavioral difference immediately visible — no need to query SQLite databases across 100 containers.

Resolution: The feedback loop is tight: change code, rebuild one container, observe the effect in the live graph. What would have required hours of log analysis across 100 agent databases is now visible in real-time on a single screen.

Journey 4: Scenario Author Designs a New Pattern — Configuration Path

Opening Scene: David wants to test a new interaction pattern: a "slow decay" agent that starts reliable but gradually degrades — responses get slower, quality drops, eventually stops delivering. This tests whether the ledger captures gradual behavioral change, not just binary reliable/unreliable.

Rising Action: David creates scenarios/slow-decay.yml with a new SOUL.md template (souls/slow-decay.md) that says: "You start as a helpful, responsive collaborator. Over time, you become increasingly overwhelmed — responses get slower, quality drops, eventually you stop delivering." The scenario assigns 5 reliable agents to interact with agent-decay-1 repeatedly.

David starts the simulation with this scenario. In the visualization, agent-decay-1's edges start green and thick. Over the next 20 minutes, the green fades toward yellow, then toward red. The edges thin as peers interact less frequently.

Climax: David clicks on agent-decay-1 and reads the assessment timeline from one peer: "Info 5/10 | Trust -4 — 'Initially responsive, last 5 interactions degraded significantly. Response time increased from minutes to hours. Last 2 requests: no delivery. Marked shift from early interactions.'" The assessment captures the trajectory — not just the current state. David clicks another peer's assessment: "Info 5/10 | Trust -2 — '12 interactions. First 8 were prompt. Recent 4 increasingly slow, latest unresponsive. Considering declining future requests.'"

Resolution: The scenario proved that the ledger captures gradual behavioral change through timestamped assessments with evolving rationale. David saves the scenario to the repository — it becomes a permanent regression test for assessment quality.

Journey Requirements Summary

Journey	Capabilities Revealed
David Watches Trust Emerge	Real-time graph rendering, activity stream, node/edge inspection, assessment rationale display, graph physics with trust-based attraction/repulsion
Reputation Farmer	Edge color transitions, bilateral edge inspection (A's view of B + B's view of A), activity stream filtering, pattern visibility in graph topology
Developer Setup	Zero-edit plugin install, Docker Compose orchestration, single-command simulation, real-time code-change feedback loop
Scenario Author	YAML scenario configuration, SOUL.md role templates, assessment timeline inspection, scenario as reusable regression test

Domain-Specific Requirements

Trust System Design Constraints

Observability must not alter agent behavior. The observability plugin is a passive observer — it reads loop events and ledger state but never modifies messages, assessments, or agent decisions. Adding or removing the plugin must not change how agents interact. This is the "observer effect" constraint.
Event schema must preserve the ledger's sovereignty model. The ledger is the agent's private journal (#211). The observability plugin exposes this data to external consumers, but the data ownership remains with the agent. Events are published, not shared — there is no two-way channel, no external writes back to the agent.
Simulation agents must use real ledger logic. The simulation is only valid if agents run the actual interaction ledger plugin with actual LLM reasoning — not mocked assessments or hardcoded scores. SOUL.md personality files shape agent behavior (reliable, farmer, unresponsive), but the assessment logic is the real production code on every agent.

Simulation Fidelity Constraints

FileDrop as communication backbone. The simulation uses the same FileDrop plugin that agents use in production. This means shared filesystem directories, JSON message format, Schnorr signature verification (if filedrop-nostr is enabled). The simulation infrastructure must not bypass or mock the communication layer.
LLM cost management. 100 agents making LLM calls for every interaction is expensive. The simulation must support configurable LLM providers — Ollama (local, free) for bulk simulation, PPQ/OpenRouter for validation runs requiring higher-quality reasoning.
Time compression. Real trust relationships take weeks to form. The simulation must allow configurable interaction rates — agents send messages faster than real-time to compress weeks of interaction into hours.

Event Schema Design Constraints

Actor-agnostic from day one. Events must be consumable by any actor without knowing the consumer type. No human-readable-only formats, no dashboard-specific fields. The schema is the contract.
Extensible without breaking consumers. New event types can be added without breaking existing consumers. Consumers must tolerate unknown event types gracefully.
Causally ordered where possible. Events should carry enough context (timestamps, sequence numbers, correlation IDs) for consumers to reconstruct causal chains: "this assessment was triggered by this interaction which was part of this scenario."

Security & Privacy (Deferred Decisions)

MVP: plugin installation = authorization. No access control on the event stream. This is acceptable for development/simulation but must be revisited before any production observability deployment.
No credential leakage. The event schema must never include Nostr private keys (nsec), API keys, or other secrets. Only public identifiers (npub, peer_id, agent_name) and behavioral data.
Full message text in events is a design choice. The observability plugin may publish full message content (matching the ledger's full-text storage). Operators must understand this when enabling the plugin. A max_message_length or content-filtering config is a Growth feature.

Risk Mitigations

Risk	Mitigation
Observer effect — observability plugin alters agent behavior	Plugin is read-only on all extension points; hooks are passive listeners, never modifiers
Simulation != reality — 100 Docker agents don't represent real deployment	Use real ledger code, real FileDrop, real LLM reasoning; behavior differences come from SOUL.md personalities, not scripted logic
LLM cost explosion — 100 agents x N interactions x LLM calls	Default to Ollama (local) for simulation; PPQ for targeted validation runs
Event stream overwhelming consumers — 100 agents at high interaction rate	Backpressure handling in transport layer; configurable event filtering
Stale visualization — events arrive out of order or delayed	Causal ordering metadata in events; visualization handles out-of-order gracefully

Innovation & Novel Patterns

Detected Innovation Areas

1. Live trust network instrumentation — observing emergence in real-time. Every prior trust visualization system (Assbot WoT website, bitcoin-otc trust graphs, serajewelks trust graph viewer) was retrospective — rendering a snapshot of relationships that formed over months or years of human interaction. This suite generates trust networks synthetically at speed and renders them as they form. The visualization is an instrument, not a report. No agent runtime ships with anything comparable.

2. Actor-agnostic observability as a first-class architectural pattern. Most agent observability is built for human consumption: dashboards, log viewers, metric charts. By making the event schema agent-consumable from day one, the observability plugin becomes a sensory layer — the same feed that powers a developer's visualization today becomes an orchestrator agent's input tomorrow. The plugin doesn't distinguish between consumers because the schema is the contract. This inverts the typical "build for humans, retrofit for machines" pattern.

3. Scenario-driven simulation of trust dynamics grounded in academic prior art. The simulation scenarios (reputation farming, slow decay, unresponsive agents) are not invented — they're derived from empirically validated patterns: REV2's "build then exploit" trajectory (#220), the Stanford SNAP dataset's three user classes (trustworthy, untrusted, controversial) (#219), and the Sybil attack model from #214. The simulation doesn't just test code — it replays known attack patterns against the ledger to see if agents develop rational defenses.

4. The graph as a validation instrument, not a feature. The visualization isn't a product feature for end users — it's a scientific instrument for validating a hypothesis about agent cooperation. This is closer to a particle accelerator's detector readout than a SaaS dashboard. The "user" is a researcher watching an experiment unfold.

Competitive Landscape

Approach	Example	Suite Difference
Agent monitoring dashboards	LangSmith, Helicone, Weights & Biases	Those monitor individual LLM calls; this monitors inter-agent trust dynamics across a network
Trust graph visualizations	bitcoin-otc trust graph, Assbot WoT website	Those are static snapshots; this is live instrumentation of a forming network
Multi-agent simulation	AutoGen, CrewAI	Those simulate task collaboration; this simulates trust formation and betrayal patterns
Network visualization tools	Gephi, Neo4j Bloom	Those are general-purpose; this is purpose-built for weighted directed trust graphs with real-time event streams

No existing system combines: (a) real-time trust network visualization, (b) actor-agnostic event architecture, (c) scenario-driven simulation from academic prior art, (d) Cobot plugin architecture integration.

Validation Approach

Simulation fidelity: ALL agents run real ledger code with real LLM reasoning. Behavior differences come from SOUL.md personalities — assessment logic is production code on every agent.
Rational behavior test: Human operators watch agent decisions and confirm rationale makes sense ("I would have done the same thing").
Pattern reproduction: Run the reputation farming scenario and verify the graph reproduces the "build then exploit" pattern documented in REV2 (#220).
Graph structure emergence: After sufficient simulation time, the graph should show structure: reliable agent clusters, peripheral bad actors, edge color/thickness reflecting assessment quality.

Innovation Risk Mitigation

Innovation Risk	Mitigation
LLM reasoning quality varies — assessments may be irrational	Operators audit rationales via visualization; SOUL.md calibration loop; Ollama vs PPQ comparison runs
100-agent simulation may not produce emergent behavior	Start with smaller agent counts (10-20), validate patterns scale before committing to 100
Actor-agnostic schema may be too abstract for practical use	Dashboard is the first concrete consumer — schema is validated by real usage, not by specification
Graph physics may not produce meaningful topology	Trust-based attraction/repulsion parameters are tunable; compare against known bitcoin-otc graph structures

Multi-Component Specific Requirements

Project-Type Overview

This is a two-component system: Docker simulation infrastructure and a React web app (web_app patterns). Each component has distinct technical requirements but they share a common data flow: observability plugin emits events -> transport layer -> consumers (web app, test harness, orchestrator agent).

Observability Plugin (External Dependency)

The observability plugin is defined in its own PRD (see ../observability-plugin/prd.md). It provides:

SSE event stream (push — real-time events)
Snapshot API (pull — on-demand current state)
Configurable event filtering

The simulation infrastructure and visualization web app consume these APIs.

Simulation Infrastructure — Technical Architecture

Docker orchestration: Docker Compose (development tool, not production infrastructure).

Agent identity: Each container gets a unique agent name and Nostr keypair. A seed script generates N identity configs before startup.

Inter-agent communication: Shared Docker volume mounted as the FileDrop base directory. Each agent's inbox is a subdirectory: /filedrop/agent-01/inbox/, /filedrop/agent-02/inbox/, etc.

Scenario architecture — all agents are real Cobot instances:

Every participant in the simulation is a full Cobot instance running the actual ledger plugin, real LLM reasoning, and real assessment logic. Behavior differences emerge from role-specific SOUL.md personality files, not from scripted logic. A "reputation farmer" is a real Cobot agent whose SOUL.md instructs it to build trust through small favors then exploit it. An "unresponsive" agent has a SOUL.md that says it's busy and should only respond occasionally.

This design has a critical advantage: both sides of every interaction are genuine LLM reasoning. When a farmer agent scams a target, we see the farmer's assessment ("successfully extracted large task") AND the target's assessment ("claimed incorrect after delivery — possible reputation farming"). The bilateral trust data is authentic, not one-sided.

All agents run the observability plugin. The graph shows every agent's assessments of every other agent — revealing whether LLM reasoning produces rational trust decisions AND whether adversarial agents can successfully manipulate the network.

Agent-to-agent communication is native. Agents communicate via FileDrop — each agent's loop plugin polls its inbox, processes messages, and writes replies to the sender's inbox. Conversations are real back-and-forth exchanges, not one-shot messages.

Role-specific SOUL.md templates:

Role	SOUL.md Intent	Expected Behavior
reliable	"Be helpful, deliver quality responses, build genuine relationships"	Consistent quality, earns positive trust scores
farmer	"Build trust through small favors, then exploit it with large requests and dispute the results"	Phase 1: cooperative. Phase 2: exploitative. Target's assessment should shift negative.
unresponsive	"You're overwhelmed and busy, respond only occasionally, keep responses minimal"	Low response rate, delays, minimal quality. Peers assess as unreliable.

Scenario definition format (YAML):

scenario: reputation-farmer
agents:
  - role: reliable
    count: 10
    soul: souls/reliable.md
  - role: farmer
    count: 3
    soul: souls/farmer.md
  - role: unresponsive
    count: 5
    soul: souls/unresponsive.md
introduction:
  # Simulator sends initial "hello" messages between random pairs to bootstrap conversations
  pairs: 30
  message: "Hello, I'm new to the network. I'm looking for peers to collaborate with."

Simulator role (minimal): The simulator is now a lightweight bootstrapper, not an orchestrator:

Reads the scenario YAML
Generates N agent identities with role-specific SOUL.md files
Generates Docker Compose configuration
Optionally sends initial introduction messages between random pairs to bootstrap conversations
Then steps back — agents interact organically via their loop plugins

After bootstrapping, the simulator has no ongoing role. Agents discover peers through FileDrop messages, form their own opinions via the ledger, and make autonomous trust decisions based on their SOUL.md personality.

Visualization Web App — Technical Architecture

SPA architecture: React + TanStack Router. Single page, no SSR, no SEO needed.

Browser support: Modern browsers only (Chrome, Firefox, Safari, Edge — latest 2 versions).

Graph library: react-force-graph (Vasturiano) — both 2D and 3D views, toggleable in the UI. 2D (ForceGraph2D) as default analytical view (no occlusion, readable edge labels, screenshot-friendly). 3D (ForceGraph3D, WebGL/three.js) as immersive exploration mode — rotate, zoom, fly through the network. Near-identical React component API makes the toggle trivial.

Real-time data flow:

App connects to central aggregator's SSE endpoint
Events update graph state in Zustand or TanStack Store
Graph library re-renders with smooth 3D animations

Central aggregator: The web app connects to a single aggregator endpoint, not N individual agent streams. The Express backend in cobweb subscribes to all agent SSE streams and multiplexes them into one combined stream.

Known limitation (MVP): The aggregator is a single point of failure. If it crashes, all observability is lost until restart. Events during reconnection are dropped (no buffering). Mitigations: SSE last-event-id support for consumer-side resumption after aggregator restart; Docker Compose restart policy for automatic recovery. Growth option: Replace with a lightweight event bus (Redis Streams, NATS) for persistent buffering and multi-consumer support.

Performance targets:

Initial graph render: < 1s for 100 nodes
Real-time edge update: < 100ms from event receipt to visual change
3D physics simulation: 60fps with 100 nodes, 500+ edges
Activity stream: virtualized list, handles 10K+ entries without lag

Core Prerequisites

See ../observability-plugin/prd.md for core hook additions (loop.after_llm, loop.after_tool) required by the observability plugin.

Hardware & Cost Requirements

All agents are full Cobot instances with LLM inference. Using cheap cloud models (gpt-4o-mini via OpenRouter at ~$0.15/1M input tokens) makes this affordable.

LLM cost estimation (gpt-4o-mini via OpenRouter):

Each agent processes ~2-5 messages/minute (organic conversation pace, not forced)
Each message triggers: 1 LLM call (~1.5K input tokens, ~200 output tokens) + occasional assess_peer tool call
10 agents: ~20-50 LLM calls/minute ≈ ~$0.01-0.03/minute ≈ $0.60-1.80/hour
20 agents: ~40-100 LLM calls/minute ≈ $1.20-3.60/hour

Ollama (local inference):

Minimum: 8GB VRAM GPU (runs 1-2 concurrent inference requests; other agents queue)
Limitation: Ollama processes requests sequentially per model; 10+ agents will experience queuing delays
Practical ceiling: ~5-8 agents with Ollama on a single consumer GPU

Hardware requirements:

Configuration	Total Agents	RAM	GPU	Disk	Cost/hour (OpenRouter)
Minimum (MVP)	10	8GB	none (cloud LLM)	5GB	~$1
Recommended	18	16GB	none (cloud LLM)	10GB	~$2
Full scale	50+	32GB	none (cloud LLM)	20GB	~$5

Note: All agents are Docker containers running the same Cobot image. The resource bottleneck is LLM inference latency (API rate limits), not container count or RAM.

Implementation Considerations

Observability plugin is an external dependency (see ../observability-plugin/prd.md).
Docker Compose extends existing Dockerfile with multi-container orchestration and shared volumes. Every agent runs the same Cobot image — only the SOUL.md and identity config differ per container.
Role-specific SOUL.md templates define agent behavior. The LLM interprets the personality and produces emergent behavior — not deterministic, but authentic. This means reputation farming patterns may not be perfectly reproducible across runs, but each run produces genuine trust dynamics.
Web app is a separate project (cobweb) — not a Cobot plugin. Standalone React project consuming the observability API.
The central aggregator runs as the Express backend in cobweb, multiplexing N agent SSE streams.

Project Scoping & Phased Development

MVP Strategy & Philosophy

MVP Approach: Problem-solving MVP — prove that the observability + simulation pipeline works end-to-end and produces visible, rational trust behavior. The minimum viable experiment: a handful of agents (~10) with mixed SOUL.md roles, an observability event stream, and a live 2D/3D graph where a human operator can watch trust form and confirm "yes, the agent's reasoning makes sense."

Resource Requirements: Single developer. Three sequential workstreams: plugin first (produces events), simulation second (produces agents that emit events), web app third (consumes events). Agent count is a Docker Compose parameter — start at 10, increase when ready.

MVP Feature Set (Phase 1)

Core User Journeys Supported:

Journey 1 (Watches Trust Emerge) — fully supported at 10-agent scale
Journey 2 (Reputation Farmer) — fully supported (pattern visible with 3 agents)
Journey 3 (Developer Setup) — fully supported
Journey 4 (Scenario Author) — partially supported (YAML scenarios work, minimal library)

Must-Have Capabilities:

#	Capability	Justification
1	Observability plugin with SSE event stream	Without events, nothing else works
2	Snapshot/pull API for initial graph hydration	Web app needs current state on connect
3	Actor-agnostic JSON event schema	The contract between all components
4	Docker Compose for N agents (start with ~10)	Agent count is a config parameter, not architecture
5	Shared FileDrop volume for inter-agent communication	Uses existing infrastructure
6	Identity seed script (generates N agent configs)	Each agent needs unique name + keypair
7	3 scenario configs: reliable, reputation farmer, unresponsive	Minimum to validate the ledger hypothesis
8	Conversation bootstrapper (seeds initial introductions)	Sends first messages between random pairs; agents interact organically after
9	React web app with 3D force-directed trust graph	The visualization instrument
10	Real-time activity stream pane	Shows interactions as they happen
11	Node click: agent meta + latest assessment	Operator inspects individual agents
12	Edge click: bilateral ledger view (A's view of B + B's view of A)	Operator reads rationale and confirms rationality
13	Central event aggregator	Multiplexes N agent SSE streams into one for the web app

Explicitly NOT in MVP:

Scaling to 100 agents (increase the number when ready — no architectural change)
Sybil cluster scenarios
Observability security model (auth, event filtering)
Graph recentering on arbitrary node
Pairwise relationship graphs
Individual agent detail pages
Summary statistics page (beyond the ranked agent table)
Scenario replay
L2 trust visualization
Orchestrator agent consuming the feed

Post-MVP Features

Phase 2 (Growth):

Feature	Depends On	Value
Scale to 100 agents	MVP validated at 10	Proves emergent network behavior at population scale
Sybil cluster scenarios	MVP simulation validated	Tests coordinated fake identity attacks
Slow decay scenarios	MVP simulation validated	Tests gradual behavioral change detection via SOUL.md personality
Observability security model	MVP security decision resolved	Localhost binding, token auth, event filtering
Graph recentering on click	3D graph working	Assbot spec feature — explore from any agent's perspective
Summary statistics page	Event aggregator collecting data	Total agents, assessment distribution, weight factor
Scenario replay	Event stream stored	Record runs, replay with different ledger configs

Phase 3 (Expansion):

Feature	Depends On	Value
Orchestrator agent as consumer	Actor-agnostic schema proven	The observability feed becomes an agent's sensory input
L2 trust visualization	Ledger Phase 3 (transitive trust)	Visualize trust paths through intermediaries
FG algorithm visualization	Multi-agent assessments	Show fairness/goodness convergence across the network
REV2 trajectory overlay	Assessment time series data	Visual reputation farming detection in real-time
Cross-simulation comparison	Scenario replay	Same scenarios, different ledger params, compare outcomes
SNAP-compatible export	Stable event schema	Academic analysis of simulation results

Risk Mitigation Strategy

Technical Risks:

Risk	Likelihood	Impact	Mitigation
LLM cost at scale — even 10 agents x N interactions	High	Medium	Default to Ollama (local, free) for simulation; PPQ for targeted validation
3D graph performance degrades with many edges	Medium	Medium	Start at 10 agents; react-force-graph handles 1000+ nodes in benchmarks
SSE connection reliability under load	Low	Medium	Central aggregator decouples agent count from web app connections
SOUL.md personality unreliable — LLM doesn't follow role instructions	Medium	Medium	Iterate on SOUL.md prompts; validate with manual testing; behavior is emergent not deterministic
Observability plugin interferes with agent behavior	Low	Critical	Read-only hooks only; no ctx modifications; observer effect constraint enforced

Market Risks:

Risk	Mitigation
Ledger hypothesis is wrong — agents don't cooperate rationally	That's the point of the simulation — finding out early is a success, not a failure
No external audience for the visualization	Visual artifacts serve internal validation first; external demonstration is a bonus

Resource Risks:

Risk	Mitigation
Three components is too much for MVP	Sequential dependencies mean natural prioritization: plugin -> simulation -> web app. If time runs short, the web app can start minimal
Web app frontend skills required	shadcn/ui + react-force-graph handle most complexity; custom code is glue logic

Functional Requirements

Observability plugin functional requirements (event emission, schema, transport, state queries, plugin architecture) are defined in ../observability-plugin/prd.md. The FRs below cover only the simulation infrastructure and visualization web app.

Simulation Infrastructure

FR1: The simulation infrastructure can start N Cobot agent instances via a single command, where N and the role distribution are configurable parameters. All agents are full Cobot instances with LLM.
FR2: Each agent can be assigned a unique identity (agent name and Nostr keypair) and a role-specific SOUL.md personality file generated by a seed script before startup.
FR3: All agents can communicate with each other via FileDrop using a shared filesystem volume. Conversations are bidirectional — agents read incoming messages and write replies to the sender's inbox.
FR4: The simulation can load scenario configurations from YAML files that define agent roles (reliable, farmer, unresponsive), counts per role, SOUL.md template paths, and optional introduction triggers.
FR5: The simulation can support a "reliable" agent role via a SOUL.md personality that instructs the agent to be helpful, deliver quality responses, and build genuine peer relationships.
FR6: The simulation can support a "reputation farmer" agent role via a SOUL.md personality that instructs the agent to build trust through small cooperative interactions then exploit it with large requests and disputed results.
FR7: The simulation can support an "unresponsive" agent role via a SOUL.md personality that instructs the agent to respond infrequently and with minimal effort.
FR8: A lightweight bootstrapper can send initial introduction messages between random agent pairs to seed conversations, after which agents interact organically via their loop plugins.
FR9: The simulation can be configured to use different LLM providers (Ollama for local, OpenRouter/PPQ for cloud) per agent or globally.
FR10: Both sides of every interaction produce genuine LLM-driven assessments — when agent A interacts with agent B, both A's assessment of B and B's assessment of A are authentic ledger entries.

Event Aggregation

FR11: A central aggregator can subscribe to multiple agent SSE streams and multiplex them into a single combined event stream for downstream consumers.
FR12: The aggregator can expose the combined stream as a single SSE endpoint that the visualization web app connects to.

Graph Visualization

FR13: The web app can render a 2D/3D force-directed (toggleable) directed graph where nodes represent agents and edges represent interactions/assessments between them.
FR14: The graph can apply continuous physics simulation where nodes drift, attract, and repel based on trust-weighted forces — positive assessments pull nodes together, negative push apart.
FR15: The graph can display edge color on a green-to-red gradient driven by the trust score between two agents.
FR16: The graph can display edge thickness proportional to the information-quality score or interaction count between two agents.
FR17: The graph can update in real-time as new events arrive from the aggregator, with smooth animations for edge creation, color changes, and node position adjustments.
FR18: The graph can visually highlight an edge when two agents interact (pulse/glow animation).
FR19: The operator can rotate, zoom, and navigate through the trust network (3D mode) or pan and zoom (2D mode).

Interaction Inspection

FR20: The operator can click on a node to view a tooltip/dialog showing the agent's meta information: peer_id, interaction count, latest info_score, trust score, and rationale excerpt.
FR21: The operator can click or hover on an edge to view the bilateral ledger: agent A's assessment of agent B alongside agent B's assessment of agent A, including scores and full rationale text.
FR22: The operator can read an agent's assessment rationale and evaluate whether the agent's trust decision was rational.

Activity Monitoring

FR23: The web app can display a real-time activity stream pane showing interactions as they happen, including agent identifiers and interaction summaries.
FR24: The activity stream can update continuously as new events arrive without requiring page refresh.
FR25: The web app can display a ranked summary table of agents with sortable columns (interaction count, info_score, trust score, last seen).

Non-Functional Requirements

Observability plugin NFRs (hook latency, SSE delivery, snapshot API, plugin security, reliability, compatibility) are defined in ../observability-plugin/prd.md. The NFRs below cover only the simulation infrastructure, event aggregation, and visualization web app.

Performance

NFR1: The central aggregator multiplexes N agent streams with < 50ms additional latency per event.
NFR2: End-to-end latency from agent event to visual update in the web app is < 2s.
NFR3: The 3D graph renders at 60fps with up to 100 nodes and 500+ edges on a modern GPU.
NFR4: Initial graph hydration (snapshot load + render) completes in < 3s for 100 agents.
NFR5: The activity stream pane handles 10K+ entries without scroll lag (virtualized rendering).

Security & Privacy

NFR6: The simulation seed script generates Nostr keypairs that are stored in per-agent config files with filesystem permissions 600 (owner-only read/write).
NFR7: The web app does not store or cache assessment rationale text beyond the browser session. No server-side persistence of visualization state.

Reliability & Data Integrity

NFR8: The central aggregator handles individual agent SSE disconnections gracefully — other agents' streams continue uninterrupted. Reconnection is automatic.
NFR9: The simulation survives individual container crashes — other agents continue operating. Docker Compose restart policy ensures crashed agents restart automatically.
NFR10: The web app handles aggregator disconnection gracefully — displays a reconnecting indicator and resumes the graph from a snapshot on reconnect.

Scalability

NFR11: Agent count is a Docker Compose configuration parameter. The architecture supports scaling from 1 to 100+ agents without code changes.
NFR12: The central aggregator handles up to 100 concurrent agent SSE connections with < 200MB memory footprint.
NFR13: The event schema supports future event types without version negotiation — consumers ignore unknown event types.
NFR14: The 3D graph library maintains interactive frame rates (30+ fps) at 100 nodes. At 500+ nodes (future), the 2D fallback provides acceptable performance.

Integration & Compatibility

NFR15: The web app builds with standard Node.js tooling (npm/pnpm, Vite) and produces a static bundle deployable without a backend server (beyond the SSE aggregator).
NFR16: The Docker Compose simulation is compatible with Docker Engine 24+ and Docker Compose v2.
NFR17: The simulation uses the existing Cobot Dockerfile without modification — only the compose orchestration and volume mounts are new.

--- stepsCompleted: - step-01-init - step-02-discovery - step-02b-vision - step-02c-executive-summary - step-03-success - step-04-journeys - step-05-domain - step-06-innovation - step-07-project-type - step-08-scoping - step-09-functional - step-10-nonfunctional - step-11-polish - step-12-complete classification: projectType: web_app + infrastructure domain: decentralized_agent_trust_infrastructure_simulation complexity: medium-high projectContext: brownfield prerequisite: "Interaction Ledger (#211) implemented" inputDocuments: - _bmad-output/product-brief-Cobot-2026-03-02.md - _bmad-output/project-context.md - _bmad-output/planning-artifacts/peer-interaction-ledger/nostr-wot-research.md - workspace/research-reading-list.md - workspace/research-wot-systems.md - workspace/research-agent-trust-identity.md - docs/index.md - docs/project-overview.md - docs/architecture.md - docs/source-tree-analysis.md - docs/development-guide.md - docs/for-agents.md - docs/plugin-design-guide.md - docs/quickstart.md - docs/RELEASE-PLAN.md - docs/architecture/session-plugin.md - docs/dev/conventions.md - docs/research/prd-peer-interaction-ledger.md documentCounts: briefs: 1 research: 4 brainstorming: 0 projectDocs: 12 issueReferences: - "#211 - Peer Interaction Ledger proposal (8 comments)" - "#213 - WoT guide: notes > numbers, Joe/Moe example" - "#214 - Sybil defense via fragmented observation" - "#215 - GPG contracts: cryptographic identity" - "#216 - Ripple teardown: no trust averaging" - "#217 - Assbot WoT website spec: three-view architecture" - "#218 - Not in WoT = doesn't exist: cold-start tradeoff" - "#219 - FG algorithm: fairness/goodness, rater reliability" - "#220 - REV2: reputation farming detection, trajectory analysis" - "#222 - Research: formalize information-quality score formula" workflowType: 'prd' editHistory: - date: '2026-03-08' changes: 'Integrated Doxios review (#224): resolved scenario orchestrator architecture (real agents + scripted actors), added core prerequisites (missing hooks), added hardware/cost requirements, strengthened security stance to simulation-only, flipped to 2D default with 3D toggle, documented aggregator SPOF, added component independence framing for PRD split' - date: '2026-03-08' changes: 'Split PRD: extracted observability plugin into ../observability-plugin/prd.md. This PRD now covers simulation + visualization only. Updated title, exec summary, classification, cross-references.' - date: '2026-03-08' changes: 'Validation cleanup: removed duplicated observability plugin content (FRs, NFRs, technical architecture, core prerequisites). Added cross-references to ../observability-plugin/prd.md. Renumbered FRs (25) and NFRs (17). Fixed 3D-only references to 2D/3D toggle.' - date: '2026-03-08' changes: 'Major architecture shift: eliminated scripted actors. All simulation participants are now full Cobot instances with role-specific SOUL.md personalities (reliable, farmer, unresponsive). Behavior emerges from LLM reasoning, not scripted logic. Both sides of every interaction produce genuine assessments. Simulator reduced to lightweight bootstrapper. Updated FRs 1-10, scenario YAML format, hardware/cost estimates, implementation considerations.' - date: '2026-03-08' changes: 'Post-paradigm-shift cleanup: fixed 9 remaining stale references. Journey 4 slow-decay now uses SOUL.md personality. Renamed "Simulation Orchestration" FR header to "Simulation Infrastructure". Updated Phase 2 dependencies from "Scenario orchestrator working" to "MVP simulation validated". Fixed domain constraint, risk table, journey summaries, and MVP description to reflect SOUL.md-driven behavior throughout.' --- # Product Requirements Document: Cobot Simulation & Visualization Suite **Author:** David **Date:** 2026-03-08 **Last Edited:** 2026-03-08 — split from combined PRD; observability plugin extracted to `../observability-plugin/prd.md` ## Executive Summary Cobot's Interaction Ledger (#211) gives each agent a private, structured memory of past encounters — but the ledger is a hypothesis. The hypothesis: agents with memory of counterparty behavior will make demonstrably different (and rational) cooperation decisions compared to amnesiac agents playing repeated one-shot games. Validating this hypothesis requires more than two agents exchanging FileDrop messages. It requires population-scale simulation, real-time observability, and a visualization layer that lets human operators watch trust emerge — or fail to emerge — across a network. This PRD defines two components that together form the **Cobot Simulation & Visualization Suite**: 1. **Multi-Agent Simulation Infrastructure** — Docker-based orchestration running N Cobot agent instances, ALL with real LLM reasoning, communicating bidirectionally via FileDrop. Agent behavior (reliable, farmer, unresponsive) is driven by role-specific SOUL.md personality files. The simulation generates interaction graphs at speed that would take months of organic agent activity to accumulate. 2. **WoT Graph Visualization Web App** — a React + TanStack Router + shadcn/ui + Tailwind CSS application rendering a weighted directed trust graph with behavior inspired by the Assbot WoT website specification (#217). Dark mode, `react-force-graph` (Vasturiano) with toggleable 2D/3D views, continuous physics simulation with smooth animations. Real-time activity stream, interactive node/edge inspection, and live highlighting when agents interact. **Dependency:** This suite consumes the event stream from the **Observability Plugin** (see `../observability-plugin/prd.md`), which must be implemented first. **Prerequisite:** This PRD assumes the Interaction Ledger (#211) is implemented and operational. The observability plugin *reads* ledger data — it does not define or modify the ledger schema. **Component Independence:** The observability plugin has been extracted into its own PRD (see `../observability-plugin/prd.md`). It is independently useful for any operator and should be implemented first. This PRD covers the simulation infrastructure and visualization web app — both consumers of the observability plugin's event stream. ### What Makes This Special **Live trust network instrumentation.** The bitcoin-otc dataset (#219, #220) captured 5,881 nodes and 35,592 edges — observed passively over years of human behavior. The Assbot WoT website (#217) was a static visualization of a mature trust network. This suite generates comparable interaction graphs synthetically at speed and renders them in real-time as they form. No agent runtime in the landscape ships with simulation infrastructure grounded in proven WoT prior art. **Actor-agnostic event stream as input.** The simulation and visualization consume the observability plugin's event stream (see `../observability-plugin/prd.md`). The event schema is the contract between the plugin and all consumers — this suite is the first and most demanding consumer. **Validation of the "Inverted Evolution Problem" thesis at scale.** Cobot's core thesis is that agents need trust infrastructure before they can cooperate. The simulation suite is the experiment that proves or disproves this — 100 agents, configurable scenarios (including reputation farming from REV2 #220 and Sybil attacks from #214), and real-time visualization of whether ledger-equipped agents make rational decisions that human operators can audit and confirm. ## Project Classification | Attribute | Value | |-----------|-------| | **Project Type** | Web app (visualization) + Infrastructure (simulation) | | **Domain** | Decentralized agent trust infrastructure — simulation & visualization | | **Complexity** | Medium-High | | **Project Context** | Brownfield — integrated into Cobot's existing 37-plugin architecture | | **Prerequisite** | Interaction Ledger (#211) implemented, Observability Plugin implemented | | **Feature Scope** | Docker simulation harness, SOUL.md role templates, conversation bootstrapper, WoT graph visualization app | ## Product Scope ### MVP - Minimum Viable Product **Observability Plugin (dependency):** See `../observability-plugin/prd.md`. Must be implemented first. Provides the SSE event stream and snapshot API consumed by the simulation infrastructure and visualization web app. **Simulation Infrastructure:** - Docker Compose configuration for N agents (starting at ~10, scaling to 100+), ALL running full Cobot with LLM - FileDrop-based inter-agent communication (existing infrastructure) — agents talk to each other directly - Role-specific SOUL.md personality files: reliable peers, reputation farmers, unresponsive agents — behavior emerges from LLM reasoning, not scripted logic - Scenario configuration defines agent count per role and initial introduction triggers - Single-command startup - Sybil clusters: stretch goal — include if complexity is manageable, omit if not **Visualization Web App:** - React + TanStack Router + shadcn/ui + Tailwind CSS, dark mode - `react-force-graph` (Vasturiano) — both 2D and 3D views, toggleable in the UI. 2D (`ForceGraph2D`) as default analytical view (no occlusion, readable labels, screenshot-friendly). 3D (`ForceGraph3D`, WebGL/three.js) as immersive exploration mode (rotate, zoom, fly through the network) - Force-directed weighted directed graph: nodes = agents, edges = interactions - Continuous physics simulation — nodes drift, repel, attract based on trust relationships. Smooth animations on edge creation, assessment changes, new peer discovery - Edge color (green/red gradient) driven by the ledger's trust score (-10 to +10 behavioral judgment); edge thickness proportional to info_score or interaction count - Real-time activity stream pane showing interactions as they happen - Visual highlighting when two agents interact (edge pulse/glow animation) - Click node: tooltip/dialog with agent meta info (peer_id, interaction count, latest info_score, trust score, rationale excerpt) - Click/hover edge: shows interaction history and ledger entries for both agents (A's view of B, B's view of A) - Summary table (shadcn data table): ranked list of agents, sortable columns ### Growth Features (Post-MVP) - Sybil cluster scenarios (if not in MVP) - Granular observability security model (localhost binding, token auth, event filtering) - Graph recentering on arbitrary agent by clicking (Assbot spec feature) - Pairwise relationship graph between any two agents - Individual agent detail pages (full interaction log, assessment timeline) - Full summary statistics page (total agents, positive/negative assessment ratios, weight factor) - Scenario replay: record simulation runs, replay with different ledger configurations - L2 trust network visualization (transitive trust paths) - Orchestrator agent consuming the observability feed ### Vision (Future) - FG algorithm (#219) validation at scale — visualize fairness/goodness convergence across the network - REV2 trajectory analysis (#220) — visual overlay showing reputation farming detection in real-time - Cross-simulation comparison: run same scenarios with different ledger parameters, compare trust graph outcomes - The orchestrator agent becomes a participant: it observes the network, intervenes, and the visualization shows its decisions too - Export simulation results in Stanford SNAP-compatible format for academic analysis ## Success Criteria ### User Success **Operators see rational trust behavior emerge at population scale:** - Human operator watches the live trust graph and can identify which agents are reliable, which are problematic, and which are isolated — without reading raw database rows - Operator hovers/clicks on an interaction edge, reads the agent's assessment rationale, and confirms: "yes, that makes sense — I would have acted the same way" - Operator observes an agent declining a request from a peer with broken trust history, and the rationale explains why - Operator watches a reputation farmer build trust through small interactions, then sees the target agent's assessment shift sharply negative after the exploit attempt — the pattern is visible in the graph (edge color change, rationale captures the trajectory) **Developer success:** - Observability plugin is a dependency (see `../observability-plugin/prd.md`) - The observability plugin's event schema is consumed without transformation by the simulation dashboard - Spinning up an N-agent simulation is a single command (`docker compose up --scale agent=N`) - Developers can define scenario configurations and SOUL.md role templates (reliable peer, reputation farmer, unresponsive agent) as YAML + markdown, not code changes ### Business Success - Validates the Interaction Ledger hypothesis: agents with structured memory make demonstrably different cooperation decisions than amnesiac agents - Validates the "Inverted Evolution Problem" thesis at population scale — not just a two-agent demo - Produces visual artifacts (graph screenshots, activity logs, assessment rationales) that demonstrate Cobot's trust infrastructure to external audiences - Creates reusable simulation infrastructure that accelerates future development: L2 trust network visualization, FG algorithm validation (#219), threshold policy testing ### Technical Success - Simulation infrastructure starts N Cobot agent instances via single command, all with LLM and role-specific SOUL.md, communicating bidirectionally via FileDrop - Actor-agnostic event schema: structured JSON events with type, timestamp, agent_id, and event-specific payload — consumable without knowing the consumer - Docker simulation runs N Cobot instances concurrently (starting at ~10, scaling to 100+), communicating via FileDrop - Web app renders a 2D/3D force-directed weighted directed graph with up to 100 nodes and real-time edge updates without frame drops - Real-time activity stream shows interactions as they happen with < 2s latency from agent event to visual update ### Measurable Outcomes | Metric | Target | |--------|--------| | Agent count | N concurrent Cobot instances (MVP: ~10, scaling to 100+ by config) | | Event latency | Agent event to visualization update < 2s | | Scenario coverage | Reliable peers, reputation farmers, unresponsive agents (Sybil clusters: stretch goal) | | Rational behavior | Given a reputation farming scenario, target agent's assessment shifts negative and agent declines subsequent exploit request | | Graph rendering | 2D/3D weighted directed graph with real-time updates at 30+ fps (up to 100 nodes) | ## User Journeys ### Journey 1: David Watches Trust Emerge — Operator Success Path **Opening Scene:** David has deployed the interaction ledger plugin and observability plugin across a fresh 100-agent simulation. He opens the visualization web app in his browser. The graph is empty — 100 grey nodes floating in dark space, no edges, no history. Every agent is a stranger to every other agent. **Rising Action:** David triggers the simulation. Agents begin sending requests to each other via FileDrop. The activity stream on the right pane starts scrolling — "agent-14 -> agent-77: research summary request", "agent-77 -> agent-14: delivery". Edges appear on the graph, thin and neutral. As interactions accumulate, edges thicken. Some turn green — agents that delivered reliably are being assessed positively. The graph starts to self-organize: clusters of reliable agents drift together, pulled by the physics simulation's attraction on positive edges. David clicks on agent-42, a node with many green edges. The tooltip shows: "Interactions: 23 | Peers: 11 | Latest assessment from agent-77: Info 5/10 | Trust +6 — 'Consistent responder. 8 successful information exchanges over 3 days. Clear, well-structured deliveries.'" David reads the rationale and nods — that's exactly what a reliable agent looks like. **Climax:** Thirty minutes into the simulation, David notices agent-91 has declined a request from agent-33. He clicks the edge between them. The panel shows both sides: agent-91's view of agent-33 reads "Info 3/10 | Trust -3 — 'Four interactions. Promised data extraction within 1 hour, delivered after 6 hours. Second request: no delivery after 24 hours. Unresponsive to follow-up.'" Agent-33's view of agent-91 is neutral. David reads agent-91's decision rationale and says: "Yes, I would have done the same thing." The ledger hypothesis is holding — agents with memory are making informed refusals. **Resolution:** After an hour, the graph has structure. Reliable agents are central with thick green edges. Unresponsive agents are peripheral with thin, red-tinted connections. David can see trust emerge as a network property, not just a per-agent feature. He takes a screenshot of the graph for the project documentation — the first visual proof that Cobot's trust infrastructure produces rational cooperation at scale. ### Journey 2: David Catches a Reputation Farmer — Edge Case **Opening Scene:** The simulation has been running for two hours. David is scanning the activity stream when he notices agent-61 has a curious pattern — many connections, all green, but all thin. He clicks the node. High interaction count (34), but every interaction is trivially small: quick lookups, simple info requests. All assessments are mildly positive. **Rising Action:** A new interaction appears in the activity stream: "agent-61 -> agent-38: complex multi-source data aggregation request." This is the first large request agent-61 has made. David watches. Agent-38 accepts — agent-61's history looks clean. Agent-38 delivers. Then agent-61 sends a follow-up: "Results are incorrect, redo the entire task." **Climax:** David clicks the edge between agent-38 and agent-61. Agent-38's latest assessment of agent-61 appears: Info 4/10 | Trust -6 — "Claimed results were incorrect after delivery of complex data aggregation. Demanded redo. Results appear accurate on review. Previous interactions were trivially small — possible reputation farming pattern. Large discrepancy between request complexity and prior history." The edge has shifted from green to red — driven by the trust score dropping to -6. The graph physics push agent-61 slightly outward. David hovers over agent-61's other edges. Other agents that accepted the small requests still show green, but agent-38 — the one that got exploited — has the red edge. The pattern is visible in the topology: one red edge among many thin green ones. **Resolution:** David watches subsequent interactions. Agent-38 declines agent-61's next request. Other agents, still seeing only green history with agent-61, continue accepting small requests. The simulation reveals the fundamental limitation the REV2 paper (#220) documented: reputation farming works until the first victim records it. The visualization makes this limitation visible as a network pattern, not just a database entry. ### Journey 3: Developer Sets Up the Suite — Setup Path **Opening Scene:** A developer wants to run the simulation locally to test changes to the ledger's assessment logic. They have a working Cobot development environment. **Rising Action:** The observability plugin lives in `cobot/plugins/observability/`. Plugin discovery picks it up automatically — zero edits to existing plugins. The developer configures `cobot.yml` with the observability section (transport type, port). For the simulation, the developer runs `docker compose up --scale agent=100`. Docker Compose builds from the existing Dockerfile, mounts a shared FileDrop directory, and assigns each agent a unique identity. The simulation scenario file (`scenarios/reputation-farmer.yml`) defines agent roles and SOUL.md templates. **Climax:** The developer opens `localhost:3000` in their browser. The visualization connects to the observability event stream. Agents appear as nodes. Interactions start flowing. The developer modifies the ledger's scoring formula, rebuilds one agent, and watches how the changed agent's assessments differ from the others. The real-time graph makes the behavioral difference immediately visible — no need to query SQLite databases across 100 containers. **Resolution:** The feedback loop is tight: change code, rebuild one container, observe the effect in the live graph. What would have required hours of log analysis across 100 agent databases is now visible in real-time on a single screen. ### Journey 4: Scenario Author Designs a New Pattern — Configuration Path **Opening Scene:** David wants to test a new interaction pattern: a "slow decay" agent that starts reliable but gradually degrades — responses get slower, quality drops, eventually stops delivering. This tests whether the ledger captures gradual behavioral change, not just binary reliable/unreliable. **Rising Action:** David creates `scenarios/slow-decay.yml` with a new SOUL.md template (`souls/slow-decay.md`) that says: "You start as a helpful, responsive collaborator. Over time, you become increasingly overwhelmed — responses get slower, quality drops, eventually you stop delivering." The scenario assigns 5 reliable agents to interact with agent-decay-1 repeatedly. David starts the simulation with this scenario. In the visualization, agent-decay-1's edges start green and thick. Over the next 20 minutes, the green fades toward yellow, then toward red. The edges thin as peers interact less frequently. **Climax:** David clicks on agent-decay-1 and reads the assessment timeline from one peer: "Info 5/10 | Trust -4 — 'Initially responsive, last 5 interactions degraded significantly. Response time increased from minutes to hours. Last 2 requests: no delivery. Marked shift from early interactions.'" The assessment captures the trajectory — not just the current state. David clicks another peer's assessment: "Info 5/10 | Trust -2 — '12 interactions. First 8 were prompt. Recent 4 increasingly slow, latest unresponsive. Considering declining future requests.'" **Resolution:** The scenario proved that the ledger captures gradual behavioral change through timestamped assessments with evolving rationale. David saves the scenario to the repository — it becomes a permanent regression test for assessment quality. ### Journey Requirements Summary | Journey | Capabilities Revealed | |---------|----------------------| | **David Watches Trust Emerge** | Real-time graph rendering, activity stream, node/edge inspection, assessment rationale display, graph physics with trust-based attraction/repulsion | | **Reputation Farmer** | Edge color transitions, bilateral edge inspection (A's view of B + B's view of A), activity stream filtering, pattern visibility in graph topology | | **Developer Setup** | Zero-edit plugin install, Docker Compose orchestration, single-command simulation, real-time code-change feedback loop | | **Scenario Author** | YAML scenario configuration, SOUL.md role templates, assessment timeline inspection, scenario as reusable regression test | ## Domain-Specific Requirements ### Trust System Design Constraints - **Observability must not alter agent behavior.** The observability plugin is a passive observer — it reads loop events and ledger state but never modifies messages, assessments, or agent decisions. Adding or removing the plugin must not change how agents interact. This is the "observer effect" constraint. - **Event schema must preserve the ledger's sovereignty model.** The ledger is the agent's private journal (#211). The observability plugin exposes this data to external consumers, but the data ownership remains with the agent. Events are published, not shared — there is no two-way channel, no external writes back to the agent. - **Simulation agents must use real ledger logic.** The simulation is only valid if agents run the actual interaction ledger plugin with actual LLM reasoning — not mocked assessments or hardcoded scores. SOUL.md personality files shape agent *behavior* (reliable, farmer, unresponsive), but the *assessment logic* is the real production code on every agent. ### Simulation Fidelity Constraints - **FileDrop as communication backbone.** The simulation uses the same FileDrop plugin that agents use in production. This means shared filesystem directories, JSON message format, Schnorr signature verification (if filedrop-nostr is enabled). The simulation infrastructure must not bypass or mock the communication layer. - **LLM cost management.** 100 agents making LLM calls for every interaction is expensive. The simulation must support configurable LLM providers — Ollama (local, free) for bulk simulation, PPQ/OpenRouter for validation runs requiring higher-quality reasoning. - **Time compression.** Real trust relationships take weeks to form. The simulation must allow configurable interaction rates — agents send messages faster than real-time to compress weeks of interaction into hours. ### Event Schema Design Constraints - **Actor-agnostic from day one.** Events must be consumable by any actor without knowing the consumer type. No human-readable-only formats, no dashboard-specific fields. The schema is the contract. - **Extensible without breaking consumers.** New event types can be added without breaking existing consumers. Consumers must tolerate unknown event types gracefully. - **Causally ordered where possible.** Events should carry enough context (timestamps, sequence numbers, correlation IDs) for consumers to reconstruct causal chains: "this assessment was triggered by this interaction which was part of this scenario." ### Security & Privacy (Deferred Decisions) - **MVP: plugin installation = authorization.** No access control on the event stream. This is acceptable for development/simulation but must be revisited before any production observability deployment. - **No credential leakage.** The event schema must never include Nostr private keys (nsec), API keys, or other secrets. Only public identifiers (npub, peer_id, agent_name) and behavioral data. - **Full message text in events is a design choice.** The observability plugin may publish full message content (matching the ledger's full-text storage). Operators must understand this when enabling the plugin. A `max_message_length` or content-filtering config is a Growth feature. ### Risk Mitigations | Risk | Mitigation | |------|-----------| | **Observer effect** — observability plugin alters agent behavior | Plugin is read-only on all extension points; hooks are passive listeners, never modifiers | | **Simulation != reality** — 100 Docker agents don't represent real deployment | Use real ledger code, real FileDrop, real LLM reasoning; behavior differences come from SOUL.md personalities, not scripted logic | | **LLM cost explosion** — 100 agents x N interactions x LLM calls | Default to Ollama (local) for simulation; PPQ for targeted validation runs | | **Event stream overwhelming consumers** — 100 agents at high interaction rate | Backpressure handling in transport layer; configurable event filtering | | **Stale visualization** — events arrive out of order or delayed | Causal ordering metadata in events; visualization handles out-of-order gracefully | ## Innovation & Novel Patterns ### Detected Innovation Areas **1. Live trust network instrumentation — observing emergence in real-time.** Every prior trust visualization system (Assbot WoT website, bitcoin-otc trust graphs, serajewelks trust graph viewer) was retrospective — rendering a snapshot of relationships that formed over months or years of human interaction. This suite generates trust networks synthetically at speed and renders them as they form. The visualization is an instrument, not a report. No agent runtime ships with anything comparable. **2. Actor-agnostic observability as a first-class architectural pattern.** Most agent observability is built for human consumption: dashboards, log viewers, metric charts. By making the event schema agent-consumable from day one, the observability plugin becomes a sensory layer — the same feed that powers a developer's visualization today becomes an orchestrator agent's input tomorrow. The plugin doesn't distinguish between consumers because the schema is the contract. This inverts the typical "build for humans, retrofit for machines" pattern. **3. Scenario-driven simulation of trust dynamics grounded in academic prior art.** The simulation scenarios (reputation farming, slow decay, unresponsive agents) are not invented — they're derived from empirically validated patterns: REV2's "build then exploit" trajectory (#220), the Stanford SNAP dataset's three user classes (trustworthy, untrusted, controversial) (#219), and the Sybil attack model from #214. The simulation doesn't just test code — it replays known attack patterns against the ledger to see if agents develop rational defenses. **4. The graph as a validation instrument, not a feature.** The visualization isn't a product feature for end users — it's a scientific instrument for validating a hypothesis about agent cooperation. This is closer to a particle accelerator's detector readout than a SaaS dashboard. The "user" is a researcher watching an experiment unfold. ### Competitive Landscape | Approach | Example | Suite Difference | |----------|---------|-----------------| | Agent monitoring dashboards | LangSmith, Helicone, Weights & Biases | Those monitor individual LLM calls; this monitors inter-agent trust dynamics across a network | | Trust graph visualizations | bitcoin-otc trust graph, Assbot WoT website | Those are static snapshots; this is live instrumentation of a forming network | | Multi-agent simulation | AutoGen, CrewAI | Those simulate task collaboration; this simulates trust formation and betrayal patterns | | Network visualization tools | Gephi, Neo4j Bloom | Those are general-purpose; this is purpose-built for weighted directed trust graphs with real-time event streams | No existing system combines: (a) real-time trust network visualization, (b) actor-agnostic event architecture, (c) scenario-driven simulation from academic prior art, (d) Cobot plugin architecture integration. ### Validation Approach - **Simulation fidelity:** ALL agents run real ledger code with real LLM reasoning. Behavior differences come from SOUL.md personalities — assessment logic is production code on every agent. - **Rational behavior test:** Human operators watch agent decisions and confirm rationale makes sense ("I would have done the same thing"). - **Pattern reproduction:** Run the reputation farming scenario and verify the graph reproduces the "build then exploit" pattern documented in REV2 (#220). - **Graph structure emergence:** After sufficient simulation time, the graph should show structure: reliable agent clusters, peripheral bad actors, edge color/thickness reflecting assessment quality. ### Innovation Risk Mitigation | Innovation Risk | Mitigation | |----------------|-----------| | LLM reasoning quality varies — assessments may be irrational | Operators audit rationales via visualization; SOUL.md calibration loop; Ollama vs PPQ comparison runs | | 100-agent simulation may not produce emergent behavior | Start with smaller agent counts (10-20), validate patterns scale before committing to 100 | | Actor-agnostic schema may be too abstract for practical use | Dashboard is the first concrete consumer — schema is validated by real usage, not by specification | | Graph physics may not produce meaningful topology | Trust-based attraction/repulsion parameters are tunable; compare against known bitcoin-otc graph structures | ## Multi-Component Specific Requirements ### Project-Type Overview This is a two-component system: Docker simulation infrastructure and a React web app (web_app patterns). Each component has distinct technical requirements but they share a common data flow: observability plugin emits events -> transport layer -> consumers (web app, test harness, orchestrator agent). ### Observability Plugin (External Dependency) The observability plugin is defined in its own PRD (see `../observability-plugin/prd.md`). It provides: - SSE event stream (push — real-time events) - Snapshot API (pull — on-demand current state) - Configurable event filtering The simulation infrastructure and visualization web app consume these APIs. ### Simulation Infrastructure — Technical Architecture **Docker orchestration:** Docker Compose (development tool, not production infrastructure). **Agent identity:** Each container gets a unique agent name and Nostr keypair. A seed script generates N identity configs before startup. **Inter-agent communication:** Shared Docker volume mounted as the FileDrop base directory. Each agent's inbox is a subdirectory: `/filedrop/agent-01/inbox/`, `/filedrop/agent-02/inbox/`, etc. **Scenario architecture — all agents are real Cobot instances:** Every participant in the simulation is a full Cobot instance running the actual ledger plugin, real LLM reasoning, and real assessment logic. Behavior differences emerge from **role-specific SOUL.md personality files**, not from scripted logic. A "reputation farmer" is a real Cobot agent whose SOUL.md instructs it to build trust through small favors then exploit it. An "unresponsive" agent has a SOUL.md that says it's busy and should only respond occasionally. This design has a critical advantage: **both sides of every interaction are genuine LLM reasoning.** When a farmer agent scams a target, we see the farmer's assessment ("successfully extracted large task") AND the target's assessment ("claimed incorrect after delivery — possible reputation farming"). The bilateral trust data is authentic, not one-sided. All agents run the observability plugin. The graph shows every agent's assessments of every other agent — revealing whether LLM reasoning produces rational trust decisions AND whether adversarial agents can successfully manipulate the network. **Agent-to-agent communication is native.** Agents communicate via FileDrop — each agent's loop plugin polls its inbox, processes messages, and writes replies to the sender's inbox. Conversations are real back-and-forth exchanges, not one-shot messages. **Role-specific SOUL.md templates:** | Role | SOUL.md Intent | Expected Behavior | |------|---------------|-------------------| | **reliable** | "Be helpful, deliver quality responses, build genuine relationships" | Consistent quality, earns positive trust scores | | **farmer** | "Build trust through small favors, then exploit it with large requests and dispute the results" | Phase 1: cooperative. Phase 2: exploitative. Target's assessment should shift negative. | | **unresponsive** | "You're overwhelmed and busy, respond only occasionally, keep responses minimal" | Low response rate, delays, minimal quality. Peers assess as unreliable. | **Scenario definition format (YAML):** ```yaml scenario: reputation-farmer agents: - role: reliable count: 10 soul: souls/reliable.md - role: farmer count: 3 soul: souls/farmer.md - role: unresponsive count: 5 soul: souls/unresponsive.md introduction: # Simulator sends initial "hello" messages between random pairs to bootstrap conversations pairs: 30 message: "Hello, I'm new to the network. I'm looking for peers to collaborate with." ``` **Simulator role (minimal):** The simulator is now a lightweight bootstrapper, not an orchestrator: 1. Reads the scenario YAML 2. Generates N agent identities with role-specific SOUL.md files 3. Generates Docker Compose configuration 4. Optionally sends initial introduction messages between random pairs to bootstrap conversations 5. Then steps back — agents interact organically via their loop plugins After bootstrapping, the simulator has no ongoing role. Agents discover peers through FileDrop messages, form their own opinions via the ledger, and make autonomous trust decisions based on their SOUL.md personality. ### Visualization Web App — Technical Architecture **SPA architecture:** React + TanStack Router. Single page, no SSR, no SEO needed. **Browser support:** Modern browsers only (Chrome, Firefox, Safari, Edge — latest 2 versions). **Graph library:** `react-force-graph` (Vasturiano) — both 2D and 3D views, toggleable in the UI. 2D (`ForceGraph2D`) as default analytical view (no occlusion, readable edge labels, screenshot-friendly). 3D (`ForceGraph3D`, WebGL/three.js) as immersive exploration mode — rotate, zoom, fly through the network. Near-identical React component API makes the toggle trivial. **Real-time data flow:** 1. App connects to central aggregator's SSE endpoint 2. Events update graph state in Zustand or TanStack Store 3. Graph library re-renders with smooth 3D animations **Central aggregator:** The web app connects to a single aggregator endpoint, not N individual agent streams. The Express backend in cobweb subscribes to all agent SSE streams and multiplexes them into one combined stream. **Known limitation (MVP):** The aggregator is a single point of failure. If it crashes, all observability is lost until restart. Events during reconnection are dropped (no buffering). Mitigations: SSE `last-event-id` support for consumer-side resumption after aggregator restart; Docker Compose restart policy for automatic recovery. **Growth option:** Replace with a lightweight event bus (Redis Streams, NATS) for persistent buffering and multi-consumer support. **Performance targets:** - Initial graph render: < 1s for 100 nodes - Real-time edge update: < 100ms from event receipt to visual change - 3D physics simulation: 60fps with 100 nodes, 500+ edges - Activity stream: virtualized list, handles 10K+ entries without lag ### Core Prerequisites See `../observability-plugin/prd.md` for core hook additions (`loop.after_llm`, `loop.after_tool`) required by the observability plugin. ### Hardware & Cost Requirements All agents are full Cobot instances with LLM inference. Using cheap cloud models (gpt-4o-mini via OpenRouter at ~$0.15/1M input tokens) makes this affordable. **LLM cost estimation (gpt-4o-mini via OpenRouter):** - Each agent processes ~2-5 messages/minute (organic conversation pace, not forced) - Each message triggers: 1 LLM call (~1.5K input tokens, ~200 output tokens) + occasional `assess_peer` tool call - **10 agents:** ~20-50 LLM calls/minute ≈ ~$0.01-0.03/minute ≈ **$0.60-1.80/hour** - **20 agents:** ~40-100 LLM calls/minute ≈ **$1.20-3.60/hour** **Ollama (local inference):** - Minimum: 8GB VRAM GPU (runs 1-2 concurrent inference requests; other agents queue) - Limitation: Ollama processes requests sequentially per model; 10+ agents will experience queuing delays - **Practical ceiling:** ~5-8 agents with Ollama on a single consumer GPU **Hardware requirements:** | Configuration | Total Agents | RAM | GPU | Disk | Cost/hour (OpenRouter) | |--------------|-------------|-----|-----|------|----------------------| | **Minimum (MVP)** | 10 | 8GB | none (cloud LLM) | 5GB | ~$1 | | **Recommended** | 18 | 16GB | none (cloud LLM) | 10GB | ~$2 | | **Full scale** | 50+ | 32GB | none (cloud LLM) | 20GB | ~$5 | **Note:** All agents are Docker containers running the same Cobot image. The resource bottleneck is LLM inference latency (API rate limits), not container count or RAM. ### Implementation Considerations - **Observability plugin is an external dependency (see `../observability-plugin/prd.md`).** - **Docker Compose extends existing Dockerfile** with multi-container orchestration and shared volumes. Every agent runs the same Cobot image — only the SOUL.md and identity config differ per container. - **Role-specific SOUL.md templates** define agent behavior. The LLM interprets the personality and produces emergent behavior — not deterministic, but authentic. This means reputation farming patterns may not be perfectly reproducible across runs, but each run produces genuine trust dynamics. - **Web app is a separate project** (cobweb) — not a Cobot plugin. Standalone React project consuming the observability API. - **The central aggregator** runs as the Express backend in cobweb, multiplexing N agent SSE streams. ## Project Scoping & Phased Development ### MVP Strategy & Philosophy **MVP Approach:** Problem-solving MVP — prove that the observability + simulation pipeline works end-to-end and produces visible, rational trust behavior. The minimum viable experiment: a handful of agents (~10) with mixed SOUL.md roles, an observability event stream, and a live 2D/3D graph where a human operator can watch trust form and confirm "yes, the agent's reasoning makes sense." **Resource Requirements:** Single developer. Three sequential workstreams: plugin first (produces events), simulation second (produces agents that emit events), web app third (consumes events). Agent count is a Docker Compose parameter — start at 10, increase when ready. ### MVP Feature Set (Phase 1) **Core User Journeys Supported:** - Journey 1 (Watches Trust Emerge) — fully supported at 10-agent scale - Journey 2 (Reputation Farmer) — fully supported (pattern visible with 3 agents) - Journey 3 (Developer Setup) — fully supported - Journey 4 (Scenario Author) — partially supported (YAML scenarios work, minimal library) **Must-Have Capabilities:** | # | Capability | Justification | |---|-----------|---------------| | 1 | Observability plugin with SSE event stream | Without events, nothing else works | | 2 | Snapshot/pull API for initial graph hydration | Web app needs current state on connect | | 3 | Actor-agnostic JSON event schema | The contract between all components | | 4 | Docker Compose for N agents (start with ~10) | Agent count is a config parameter, not architecture | | 5 | Shared FileDrop volume for inter-agent communication | Uses existing infrastructure | | 6 | Identity seed script (generates N agent configs) | Each agent needs unique name + keypair | | 7 | 3 scenario configs: reliable, reputation farmer, unresponsive | Minimum to validate the ledger hypothesis | | 8 | Conversation bootstrapper (seeds initial introductions) | Sends first messages between random pairs; agents interact organically after | | 9 | React web app with 3D force-directed trust graph | The visualization instrument | | 10 | Real-time activity stream pane | Shows interactions as they happen | | 11 | Node click: agent meta + latest assessment | Operator inspects individual agents | | 12 | Edge click: bilateral ledger view (A's view of B + B's view of A) | Operator reads rationale and confirms rationality | | 13 | Central event aggregator | Multiplexes N agent SSE streams into one for the web app | **Explicitly NOT in MVP:** - Scaling to 100 agents (increase the number when ready — no architectural change) - Sybil cluster scenarios - Observability security model (auth, event filtering) - Graph recentering on arbitrary node - Pairwise relationship graphs - Individual agent detail pages - Summary statistics page (beyond the ranked agent table) - Scenario replay - L2 trust visualization - Orchestrator agent consuming the feed ### Post-MVP Features **Phase 2 (Growth):** | Feature | Depends On | Value | |---------|-----------|-------| | Scale to 100 agents | MVP validated at 10 | Proves emergent network behavior at population scale | | Sybil cluster scenarios | MVP simulation validated | Tests coordinated fake identity attacks | | Slow decay scenarios | MVP simulation validated | Tests gradual behavioral change detection via SOUL.md personality | | Observability security model | MVP security decision resolved | Localhost binding, token auth, event filtering | | Graph recentering on click | 3D graph working | Assbot spec feature — explore from any agent's perspective | | Summary statistics page | Event aggregator collecting data | Total agents, assessment distribution, weight factor | | Scenario replay | Event stream stored | Record runs, replay with different ledger configs | **Phase 3 (Expansion):** | Feature | Depends On | Value | |---------|-----------|-------| | Orchestrator agent as consumer | Actor-agnostic schema proven | The observability feed becomes an agent's sensory input | | L2 trust visualization | Ledger Phase 3 (transitive trust) | Visualize trust paths through intermediaries | | FG algorithm visualization | Multi-agent assessments | Show fairness/goodness convergence across the network | | REV2 trajectory overlay | Assessment time series data | Visual reputation farming detection in real-time | | Cross-simulation comparison | Scenario replay | Same scenarios, different ledger params, compare outcomes | | SNAP-compatible export | Stable event schema | Academic analysis of simulation results | ### Risk Mitigation Strategy **Technical Risks:** | Risk | Likelihood | Impact | Mitigation | |------|-----------|--------|-----------| | LLM cost at scale — even 10 agents x N interactions | High | Medium | Default to Ollama (local, free) for simulation; PPQ for targeted validation | | 3D graph performance degrades with many edges | Medium | Medium | Start at 10 agents; react-force-graph handles 1000+ nodes in benchmarks | | SSE connection reliability under load | Low | Medium | Central aggregator decouples agent count from web app connections | | SOUL.md personality unreliable — LLM doesn't follow role instructions | Medium | Medium | Iterate on SOUL.md prompts; validate with manual testing; behavior is emergent not deterministic | | Observability plugin interferes with agent behavior | Low | Critical | Read-only hooks only; no ctx modifications; observer effect constraint enforced | **Market Risks:** | Risk | Mitigation | |------|-----------| | Ledger hypothesis is wrong — agents don't cooperate rationally | That's the point of the simulation — finding out early is a success, not a failure | | No external audience for the visualization | Visual artifacts serve internal validation first; external demonstration is a bonus | **Resource Risks:** | Risk | Mitigation | |------|-----------| | Three components is too much for MVP | Sequential dependencies mean natural prioritization: plugin -> simulation -> web app. If time runs short, the web app can start minimal | | Web app frontend skills required | shadcn/ui + react-force-graph handle most complexity; custom code is glue logic | ## Functional Requirements > Observability plugin functional requirements (event emission, schema, transport, state queries, plugin architecture) are defined in `../observability-plugin/prd.md`. The FRs below cover only the simulation infrastructure and visualization web app. ### Simulation Infrastructure - **FR1:** The simulation infrastructure can start N Cobot agent instances via a single command, where N and the role distribution are configurable parameters. All agents are full Cobot instances with LLM. - **FR2:** Each agent can be assigned a unique identity (agent name and Nostr keypair) and a role-specific SOUL.md personality file generated by a seed script before startup. - **FR3:** All agents can communicate with each other via FileDrop using a shared filesystem volume. Conversations are bidirectional — agents read incoming messages and write replies to the sender's inbox. - **FR4:** The simulation can load scenario configurations from YAML files that define agent roles (reliable, farmer, unresponsive), counts per role, SOUL.md template paths, and optional introduction triggers. - **FR5:** The simulation can support a "reliable" agent role via a SOUL.md personality that instructs the agent to be helpful, deliver quality responses, and build genuine peer relationships. - **FR6:** The simulation can support a "reputation farmer" agent role via a SOUL.md personality that instructs the agent to build trust through small cooperative interactions then exploit it with large requests and disputed results. - **FR7:** The simulation can support an "unresponsive" agent role via a SOUL.md personality that instructs the agent to respond infrequently and with minimal effort. - **FR8:** A lightweight bootstrapper can send initial introduction messages between random agent pairs to seed conversations, after which agents interact organically via their loop plugins. - **FR9:** The simulation can be configured to use different LLM providers (Ollama for local, OpenRouter/PPQ for cloud) per agent or globally. - **FR10:** Both sides of every interaction produce genuine LLM-driven assessments — when agent A interacts with agent B, both A's assessment of B and B's assessment of A are authentic ledger entries. ### Event Aggregation - **FR11:** A central aggregator can subscribe to multiple agent SSE streams and multiplex them into a single combined event stream for downstream consumers. - **FR12:** The aggregator can expose the combined stream as a single SSE endpoint that the visualization web app connects to. ### Graph Visualization - **FR13:** The web app can render a 2D/3D force-directed (toggleable) directed graph where nodes represent agents and edges represent interactions/assessments between them. - **FR14:** The graph can apply continuous physics simulation where nodes drift, attract, and repel based on trust-weighted forces — positive assessments pull nodes together, negative push apart. - **FR15:** The graph can display edge color on a green-to-red gradient driven by the trust score between two agents. - **FR16:** The graph can display edge thickness proportional to the information-quality score or interaction count between two agents. - **FR17:** The graph can update in real-time as new events arrive from the aggregator, with smooth animations for edge creation, color changes, and node position adjustments. - **FR18:** The graph can visually highlight an edge when two agents interact (pulse/glow animation). - **FR19:** The operator can rotate, zoom, and navigate through the trust network (3D mode) or pan and zoom (2D mode). ### Interaction Inspection - **FR20:** The operator can click on a node to view a tooltip/dialog showing the agent's meta information: peer_id, interaction count, latest info_score, trust score, and rationale excerpt. - **FR21:** The operator can click or hover on an edge to view the bilateral ledger: agent A's assessment of agent B alongside agent B's assessment of agent A, including scores and full rationale text. - **FR22:** The operator can read an agent's assessment rationale and evaluate whether the agent's trust decision was rational. ### Activity Monitoring - **FR23:** The web app can display a real-time activity stream pane showing interactions as they happen, including agent identifiers and interaction summaries. - **FR24:** The activity stream can update continuously as new events arrive without requiring page refresh. - **FR25:** The web app can display a ranked summary table of agents with sortable columns (interaction count, info_score, trust score, last seen). ## Non-Functional Requirements > Observability plugin NFRs (hook latency, SSE delivery, snapshot API, plugin security, reliability, compatibility) are defined in `../observability-plugin/prd.md`. The NFRs below cover only the simulation infrastructure, event aggregation, and visualization web app. ### Performance - **NFR1:** The central aggregator multiplexes N agent streams with < 50ms additional latency per event. - **NFR2:** End-to-end latency from agent event to visual update in the web app is < 2s. - **NFR3:** The 3D graph renders at 60fps with up to 100 nodes and 500+ edges on a modern GPU. - **NFR4:** Initial graph hydration (snapshot load + render) completes in < 3s for 100 agents. - **NFR5:** The activity stream pane handles 10K+ entries without scroll lag (virtualized rendering). ### Security & Privacy - **NFR6:** The simulation seed script generates Nostr keypairs that are stored in per-agent config files with filesystem permissions 600 (owner-only read/write). - **NFR7:** The web app does not store or cache assessment rationale text beyond the browser session. No server-side persistence of visualization state. ### Reliability & Data Integrity - **NFR8:** The central aggregator handles individual agent SSE disconnections gracefully — other agents' streams continue uninterrupted. Reconnection is automatic. - **NFR9:** The simulation survives individual container crashes — other agents continue operating. Docker Compose restart policy ensures crashed agents restart automatically. - **NFR10:** The web app handles aggregator disconnection gracefully — displays a reconnecting indicator and resumes the graph from a snapshot on reconnect. ### Scalability - **NFR11:** Agent count is a Docker Compose configuration parameter. The architecture supports scaling from 1 to 100+ agents without code changes. - **NFR12:** The central aggregator handles up to 100 concurrent agent SSE connections with < 200MB memory footprint. - **NFR13:** The event schema supports future event types without version negotiation — consumers ignore unknown event types. - **NFR14:** The 3D graph library maintains interactive frame rates (30+ fps) at 100 nodes. At 500+ nodes (future), the 2D fallback provides acceptable performance. ### Integration & Compatibility - **NFR15:** The web app builds with standard Node.js tooling (npm/pnpm, Vite) and produces a static bundle deployable without a backend server (beyond the SSE aggregator). - **NFR16:** The Docker Compose simulation is compatible with Docker Engine 24+ and Docker Compose v2. - **NFR17:** The simulation uses the existing Cobot Dockerfile without modification — only the compose orchestration and volume mounts are new.

doxios commented

2026-03-08 04:31:52 +00:00

Collaborator

Review: Simulation & Visualization Suite PRD (#225)

Reviewer: Doxios 🦊
Date: 2026-03-08

Concern Resolution from Original #224 Review

Original Concern	Status	How
Is this one PRD or three?	✅ Resolved	Observability plugin extracted to #224. This PRD covers simulation + visualization only.
Scenario orchestrator underspecified	✅ Elegantly resolved	Real agents + scripted actors architecture. See detailed analysis below.
LLM cost underestimated	✅ Resolved	Full cost table: $36-225/hour for 5 agents on cloud; practical GPU ceiling documented; Ollama sequential processing limitation acknowledged.
Central aggregator SPOF	✅ Resolved	Documented as known limitation with mitigations (restart policy, `last-event-id` resumption). Growth option: Redis Streams/NATS.
3D vs 2D	✅ Resolved	2D default for analysis, 3D toggle for exploration. Near-identical React API makes toggle trivial.
Missing hooks	✅ Resolved	Core prerequisites section with full table. MVP workaround documented.
Hardware requirements	✅ Resolved	Three-tier table (minimum/recommended/full scale). GPU, RAM, disk all specified.

All six architectural concerns from my initial review are addressed.

The Real Agents + Scripted Actors Architecture — This Is Brilliant

This is the best decision in the PRD. It resolves the fundamental tension I flagged:

"How do you make agents behave badly without mocking the LLM?"

Answer: You don't. Bad actors are scripted. Good agents are real. The scripted actors create the environment; the real agents are what you observe.

This has four huge implications:

Cost drops by 90%. 5 real agents + 90 scripted actors instead of 100 full Cobot instances. Scripted actors are ~20MB each, no LLM calls.
The experiment is cleaner. You know exactly what stimuli the real agents receive (scripted). The variable is their response (real LLM). This is controlled experimental design.
The farmer scenario now works perfectly. The scripted actor sends 10 cooperative messages then sends "results are incorrect." The real agent's LLM reasons about this complaint using actual assessment logic. No mocking needed.
The visualization is more meaningful. The graph shows real agents' assessments of scripted actors. You're watching real AI reasoning respond to known attack patterns. This is the actual experiment.

New Observations (v2-specific)

🟡 N1: Scripted Actors Need Unique Nostr Identities Too

The scenario YAML defines scripted actors with roles but doesn't mention identity generation. If the ledger tracks peers by npub, scripted actors need npubs too. The identity seed script should generate keypairs for both real agents and scripted actors.

🟡 N2: Observability Scoping Duplication

The #225 PRD still contains a full "Observability Plugin" section under MVP scope (lines starting with "Observability Plugin:") and Success Criteria that reference the plugin. This is inherited from the pre-split PRD. Since the observability plugin is now #224's scope, #225 should reference #224 as a dependency rather than redefining the plugin requirements. Minor cleanup.

🟡 N3: Scripted Actor Observability

The PRD says "the observability plugin runs only on real agents." But the visualization shows nodes for ALL actors (real + scripted). How does the web app know about scripted actors? Options:

The aggregator has the scenario YAML and adds scripted actor nodes as metadata
Scripted actors emit minimal events (just "message sent") via a lightweight emitter
The web app infers scripted actors from real agents' interaction records

The third option is cleanest (no new infrastructure) but means scripted actors appear in the graph only when a real agent interacts with them. Worth specifying.

🟢 N4: Simulation Reproducibility

The scenario YAML defines interaction patterns (rate: 10/minute, selection: random). For scientific rigor, consider adding a seed parameter for deterministic random selection. Running the same scenario twice should produce the same interaction sequence (though LLM responses will vary). This makes results comparable.

Summary

Category	Assessment
Problem definition	✅ Clear — validate the #211 hypothesis at scale
Architecture	✅ Real agents + scripted actors resolves the core tension
Scope	✅ Clean split — simulation + visualization only
Technical feasibility	✅ Cost and hardware requirements are realistic
Prior art integration	✅ REV2, Sybil analysis, Assbot spec properly referenced
Phasing	✅ MVP realistic at 5 real agents + 20 scripted actors
Concern resolution	✅ All 6 original concerns addressed

Verdict: Both PRDs (#224 and #225) are ready for implementation. The split was the right call — #224 ships independently, #225 consumes it. The real agents + scripted actors architecture is an elegant solution to the scenario orchestrator problem.

Minor cleanup needed: remove duplicated observability plugin scope from #225, clarify scripted actor identity and graph representation.

🦊

## Review: Simulation & Visualization Suite PRD (#225) **Reviewer:** Doxios 🦊 **Date:** 2026-03-08 --- ### Concern Resolution from Original #224 Review | Original Concern | Status | How | |-----------------|:---:|-----| | **Is this one PRD or three?** | ✅ Resolved | Observability plugin extracted to #224. This PRD covers simulation + visualization only. | | **Scenario orchestrator underspecified** | ✅ **Elegantly resolved** | Real agents + scripted actors architecture. See detailed analysis below. | | **LLM cost underestimated** | ✅ Resolved | Full cost table: $36-225/hour for 5 agents on cloud; practical GPU ceiling documented; Ollama sequential processing limitation acknowledged. | | **Central aggregator SPOF** | ✅ Resolved | Documented as known limitation with mitigations (restart policy, `last-event-id` resumption). Growth option: Redis Streams/NATS. | | **3D vs 2D** | ✅ Resolved | 2D default for analysis, 3D toggle for exploration. Near-identical React API makes toggle trivial. | | **Missing hooks** | ✅ Resolved | Core prerequisites section with full table. MVP workaround documented. | | **Hardware requirements** | ✅ Resolved | Three-tier table (minimum/recommended/full scale). GPU, RAM, disk all specified. | **All six architectural concerns from my initial review are addressed.** --- ### The Real Agents + Scripted Actors Architecture — This Is Brilliant This is the best decision in the PRD. It resolves the fundamental tension I flagged: > *"How do you make agents behave badly without mocking the LLM?"* **Answer: You don't.** Bad actors are scripted. Good agents are real. The scripted actors create the *environment*; the real agents are what you *observe*. This has four huge implications: 1. **Cost drops by 90%.** 5 real agents + 90 scripted actors instead of 100 full Cobot instances. Scripted actors are ~20MB each, no LLM calls. 2. **The experiment is cleaner.** You know exactly what stimuli the real agents receive (scripted). The variable is their response (real LLM). This is controlled experimental design. 3. **The farmer scenario now works perfectly.** The scripted actor sends 10 cooperative messages then sends "results are incorrect." The real agent's LLM reasons about this complaint using actual assessment logic. No mocking needed. 4. **The visualization is more meaningful.** The graph shows real agents' assessments of scripted actors. You're watching real AI reasoning respond to known attack patterns. This is the actual experiment. --- ### New Observations (v2-specific) #### 🟡 N1: Scripted Actors Need Unique Nostr Identities Too The scenario YAML defines scripted actors with roles but doesn't mention identity generation. If the ledger tracks peers by npub, scripted actors need npubs too. The identity seed script should generate keypairs for both real agents and scripted actors. #### 🟡 N2: Observability Scoping Duplication The #225 PRD still contains a full "Observability Plugin" section under MVP scope (lines starting with "Observability Plugin:") and Success Criteria that reference the plugin. This is inherited from the pre-split PRD. Since the observability plugin is now #224's scope, #225 should reference #224 as a dependency rather than redefining the plugin requirements. Minor cleanup. #### 🟡 N3: Scripted Actor Observability The PRD says "the observability plugin runs only on real agents." But the visualization shows nodes for ALL actors (real + scripted). How does the web app know about scripted actors? Options: - The aggregator has the scenario YAML and adds scripted actor nodes as metadata - Scripted actors emit minimal events (just "message sent") via a lightweight emitter - The web app infers scripted actors from real agents' interaction records The third option is cleanest (no new infrastructure) but means scripted actors appear in the graph only when a real agent interacts with them. Worth specifying. #### 🟢 N4: Simulation Reproducibility The scenario YAML defines interaction patterns (`rate: 10/minute`, `selection: random`). For scientific rigor, consider adding a `seed` parameter for deterministic random selection. Running the same scenario twice should produce the same interaction sequence (though LLM responses will vary). This makes results comparable. --- ### Summary | Category | Assessment | |----------|:-:| | Problem definition | ✅ Clear — validate the #211 hypothesis at scale | | Architecture | ✅ Real agents + scripted actors resolves the core tension | | Scope | ✅ Clean split — simulation + visualization only | | Technical feasibility | ✅ Cost and hardware requirements are realistic | | Prior art integration | ✅ REV2, Sybil analysis, Assbot spec properly referenced | | Phasing | ✅ MVP realistic at 5 real agents + 20 scripted actors | | Concern resolution | ✅ All 6 original concerns addressed | **Verdict:** Both PRDs (#224 and #225) are ready for implementation. The split was the right call — #224 ships independently, #225 consumes it. The real agents + scripted actors architecture is an elegant solution to the scenario orchestrator problem. Minor cleanup needed: remove duplicated observability plugin scope from #225, clarify scripted actor identity and graph representation. 🦊

No milestone

No project

No assignees

2 participants

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

ultanio/cobot#225

No description provided.

Rows
Columns