proposal: Peer Interaction Ledger #211
Labels
No labels
Compat/Breaking
Kind/Bug
Kind/Competitor
Kind/Documentation
Kind/Enhancement
Kind/Epic
Kind/Feature
Kind/Security
Kind/Story
Kind/Testing
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Reviewed
Confirmed
Reviewed
Duplicate
Reviewed
Invalid
Reviewed
Won't Fix
Scope/Core
Scope/Cross-Plugin
Scope/Plugin-System
Scope/Single-Plugin
Status
Abandoned
Status
Blocked
Status
Need More Info
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
ultanio/cobot#211
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Product Requirements Document: Cobot Interaction Ledger
Author: David
Date: 2026-03-07
Last Edited: 2026-03-08 — adopted dual-score assessment model (info_score + trust), reconciled Score Semantics with user journeys, reframed Appendix A
Executive Summary
Cobot is a minimal self-sovereign AI agent runtime (~6K lines of Python) built around the insight that agents need trust infrastructure before they can meaningfully cooperate. Today, Cobot agents can identify via Nostr keypairs (npub/nsec), communicate via FileDrop with Schnorr signatures, transact via Lightning wallet, and reason via pluggable LLM providers — but every interaction with another agent is a one-shot game. The agent has no memory of past encounters.
This PRD defines the Interaction Ledger — a local, structured, persistent record of every interaction a Cobot agent has with other agents (identified by npub). The ledger gives each agent the ability to distinguish (track which npub it interacted with), observe (record what happened — request, delivery, payment, outcome), and judge (form a local assessment of the counterparty). These three capabilities are prerequisites for any Web of Trust system, centralized or decentralized.
The Interaction Ledger is the agent's private journal — first-person observations only. It does not accept incoming ratings from other agents (which would introduce a manipulation vector) and does not publish to any external registry (which is a separate future concern). It is the foundational data layer that transforms Cobot agents from amnesiac actors playing repeated one-shot games into learning participants capable of informed cooperation and selective refusal.
Prior art grounding: The ledger's data model draws directly from proven systems — the Bitcoin-OTC rating model (
source, target, score, notes, timestamp) [1], the #bitcoin-assets L1/L2 bounded trust hierarchy [2], and Szabo's "Shelling Out" thesis on how costly tokens of delayed reciprocity enabled human cooperation beyond kin groups [3]. A key design principle from these systems: the freetext rationale accompanying each rating carried more actionable information than the numeric score — the community relied on notes to make trust decisions, with scores serving as a quick filter [4]. The key adaptation: where bitcoin-otc relied on humans manually entering;;ratecommands, the interaction ledger captures structured data automatically as a byproduct of the agent doing work.Existing foundation: Cobot's
persistenceplugin already stores conversation text per npub, and thememoryplugin defines extension points for pluggable storage backends. The interaction ledger builds on these patterns but adds what they lack: structured outcome records, quality metrics, and queryable per-npub interaction history.What Makes This Special
The missing foundational layer. Every trust and reputation system in the landscape — bitcoin-otc's gribble, deedbot's L1/L2, ERC-8004's three-registry model, Jeletor's NIP-32 attestations, Vertex's Pagerank scoring — all aggregate trust from somewhere. None of them work unless individual actors first observe and record their own interactions accurately. The Interaction Ledger explicitly builds this layer, which prior systems either assumed existed (humans have memory) or left to manual processes.
Local-first, unilateral, sovereignty-preserving. The agent trusts its own eyes. No external entity can write to the ledger, bias the agent's assessment, or access the data without the agent's consent. This aligns with Cobot's self-sovereign design philosophy: your hardware, your keys, your agent, your memory.
Plugin-native integration. Built as a Cobot plugin following existing architecture patterns (PluginMeta, capability interfaces, extension points). The ledger hooks into the message lifecycle via extension points — recording interaction data is a natural byproduct of the agent processing messages, not a separate workflow.
Project Classification
Success Criteria
User Success
Agent operators see their agents making informed decisions based on interaction history:
Developer success:
cobot ledger show <peer>,cobot ledger list,cobot ledger summary <peer>)Business Success
Technical Success
capabilities=["tools"], hooks intoloop.on_message,loop.after_send,loop.transform_system_promptsqlite3, zero new dependencies) following the knowledge plugin'sopen()/close()patternpeers(identity + stats),interactions(message evidence log),assessments(score + rationale judgments)query_peer,assess_peer,list_peersMeasurable Outcomes
peer_id< 1ms for 10K+ entries (SQLite with index)User Journeys
Journey 1: Alpha's First Interactions — Agent Success Path
Alpha is a Cobot agent running on a VPS, handling requests from other agents. It's been operating for two weeks with the interaction ledger enabled.
Opening Scene: A request arrives via FileDrop from npub-7x9k asking Alpha to summarize a set of research documents. Alpha's system prompt includes: "Peer: npub-7x9k | Interactions: 0 | First contact — no prior history." Alpha accepts — it has no reason not to. npub-7x9k's request is clear and well-structured. Alpha delivers the summary. The ledger records the exchange automatically (incoming request, outgoing delivery). Alpha uses
assess_peerto log: score +2, rationale "First interaction. Information exchange — requested research summary. Clear task description, no issues."Rising Action: Over the next week, npub-7x9k sends three more requests — a mix of information queries and a small data extraction task. Each goes well. Alpha's assessments trend upward: +2, +3, +3, +4. The rationale entries build a picture: "Consistent requester. Clear task descriptions. Four successful interactions in 8 days. Mix of information exchange and data tasks."
Then npub-q3m8 appears. Requests Alpha to collaborate on analyzing a dataset — no payment involved, just peer collaboration. Alpha's system prompt shows no history. Alpha accepts. npub-q3m8 promises to send its portion of the analysis within 2 hours. 24 hours later — nothing. Alpha follows up. No response. Alpha assesses: score -3, rationale "Collaboration request for dataset analysis. Agreed to deliver their portion within 2 hours. No delivery after 24 hours. Unresponsive to follow-up."
Climax: Three days later, npub-q3m8 returns with a new collaboration request. Alpha's system prompt now reads: "Peer: npub-q3m8 | Interactions: 3 | Last seen: 3 days ago | Info: 2/10 | Trust: -3 | Assessment: Agreed to collaborate, never delivered, unresponsive." Alpha declines. Minutes later, npub-7x9k sends a new request. System prompt: "Peer: npub-7x9k | Interactions: 8 | Info: 4/10 | Trust: +4 | Assessment: Consistent, reliable, clear communicator." Alpha prioritizes it immediately.
Resolution: Alpha is no longer playing one-shot games. It remembers who delivered and who didn't — regardless of whether sats were involved. The assessment captures behavior quality (reliability, responsiveness, follow-through), not transaction economics.
Journey 2: Alpha Meets a Reputation Farmer — Agent Edge Case
Opening Scene: npub-farm1 starts interacting with Alpha. Five small, easy requests — quick information lookups that take seconds to fulfill. All completed successfully. Alpha's assessments climb: +1, +2, +2, +3, +3. The rationale notes small but consistent interactions.
Rising Action: On the sixth interaction, npub-farm1 requests something much larger: a complex multi-source data aggregation that will consume significant LLM tokens and time. Alpha's system prompt shows a positive history. Alpha accepts.
Climax: Alpha delivers. npub-farm1 claims the results are wrong and demands Alpha redo the entire task — but the original results were accurate. Alpha has no automated dispute resolution, but it records: score -6, rationale "Claimed results were incorrect after delivery of complex data aggregation. Demanded redo. Results appear accurate on review. Previous 5 interactions were trivially small — possible reputation farming pattern. Large discrepancy between request complexity suggests deliberate trust-building before exploit."
Resolution: The ledger captures the pattern. Alpha's assessment history for npub-farm1 shows the trajectory the Stanford Bitcoin-OTC research identified: steady positive scores followed by a sharp negative. The rationale — the agent's own reasoning about the pattern — becomes institutional memory. The agent can't prevent the first exploit, but it won't be fooled twice.
Journey 3: David Audits the Ledger — Operator Path
Opening Scene: David deployed his Cobot instance three weeks ago with the ledger plugin enabled. The agent has been running autonomously, handling requests from ~15 different peers. David wants to check how the agent is performing.
Rising Action: David runs
cobot ledger list. The CLI shows all 15 known peers sorted by last interaction, with interaction counts and latest assessment scores. Two peers have negative scores. David runscobot ledger show npub-q3m8and sees the full history: interaction log, assessment timeline, the rationale explaining the non-delivery.David notices one peer (npub-abc1) has a score of -2 with rationale: "Slow response time, took 6 hours to acknowledge delivery." David thinks that's too harsh — 6 hours is reasonable for an async agent. He adds guidance to the SOUL.md: "Consider response times under 12 hours as acceptable for non-urgent interactions."
Climax: David runs
cobot ledger summaryand sees aggregate stats: 47 total interactions, 15 unique peers, 89% positive assessments, 2 peers flagged negative. The agent is performing well. David spots that one peer has been assessed 8 times in 3 days — the agent might be over-assessing after every message rather than after meaningful interaction milestones. David tunes the SOUL.md to guide assessment frequency.Resolution: The CLI gives David full visibility into the agent's trust decisions. The rationale field is the key — it's the agent's reasoning, which David can audit, calibrate, and use to improve the agent's judgment over time. The operator is in the loop without being in the critical path.
Journey 4: David Adds the Ledger Plugin — Developer Setup Path
Opening Scene: David has a running Cobot instance with 20 plugins. He wants to add the interaction ledger.
Rising Action: The ledger plugin lives in
cobot/plugins/ledger/. On next agent start, plugin discovery picks it up automatically. The ledger createsledger.dbin the workspace directory. Zero configuration required — it works out of the box.Climax: David sends a test message via stdin. The agent responds. David checks — stdin interactions are correctly skipped (synthetic sender). David triggers a FileDrop message from another agent. Checks the DB: peer created, interaction logged. Sends another message, verifies system prompt enrichment: peer context is being injected before the LLM call. Everything works.
Resolution: Zero-edit installation. No changes to any existing plugin. The ledger hooks in via extension points and starts recording immediately. David's existing 20-plugin setup is completely unaffected.
Journey Requirements Summary
Domain-Specific Requirements
Trust System Design Constraints
rationale NOT NULLon assessments. A bare numeric score without context is insufficient (key lesson from #bitcoin-assets).Security & Privacy Constraints
ledger.db) sits in the workspace directory under the operator's control.interactionstable stores complete message content, not truncated previews. This preserves the evidentiary chain required by the GPG contracts framework (#215) — truncation would destroy the evidence that gives assessments their enforcement power. Storage cost is acceptable (SQLite handles large TEXT columns efficiently). This creates temporary duplication with the persistence plugin's JSON conversation files; consolidation into a single storage layer is a planned Growth feature. Operators can optionally configuremax_message_lengthto cap storage if needed.Identity & Interoperability Constraints
peer_id) notnpub, even though Nostr is the primary identity system. This ensures the ledger works across all communication channels.ltag in a custom namespace (e.g.,io.cobot.trust), with the rationale in the content field. NIP-32 has no built-in score concept — adapter logic maps score to thequalitymetadata field (0-1 scale).Domain Risk Mitigations
For technical, market, and resource risks see Risk Mitigation Strategy in Project Scoping.
contextvars.ContextVarfor sender tracking (~5 lines of code). Wrong peer attribution in a trust system is worse than no attribution.Score Semantics: Dual-Score Model
The assessment uses a dual-score model that captures two orthogonal dimensions of peer knowledge:
Why dual scoring over either score alone:
Collapsing to info_score alone loses the behavioral dimension — a known scammer gets a high score. Collapsing to trust alone loses the confidence dimension — a +3 from 2 interactions looks the same as a +3 from 20. The dual model preserves both, with rationale as the tiebreaker.
How this maps to prior art:
Info Score: Deterministic Computation
The info_score is computed deterministically from interaction data on a 0-10 scale. The LLM never sets the info_score. This separation ensures:
Score computation formula (MVP heuristic):
MVP heuristic — subject to tuning. Known limitations:
Phase 2 research task: Formalize the information-quality function. Investigate whether REV2's behavioral anomaly detection [13] can be integrated as a penalty (e.g., if interaction patterns are "bursty" or suspiciously regular, discount the info_score). The FG algorithm's "fairness" metric [12] is the closest academic formalization, but it requires multiple raters (Phase 3).
The scale is 0-10 (unsigned) — you cannot have negative information quantity. A score of 0 means "no information," not "bad peer."
Trust Score: LLM Behavioral Judgment
The trust score is provided by the LLM alongside the rationale when the agent calls
assess_peer. It is a signed integer from -10 (known bad actor) to +10 (fully reliable), following the bitcoin-otc rating scale that the community actually used [4].Why the LLM sets the trust score:
Why the trust score is safe despite Ripple's critique:
cobot ledger listshows both scores; the rationale explains the trust score's basis.How the four layers work together:
assess_peertool)Operator guidance: When reading
cobot ledger list, the info_score tells you how much the agent knows about each peer. The trust score tells you the agent's behavioral judgment. Together they form a 2D signal: high info_score + negative trust = well-known bad actor (most actionable). Low info_score + positive trust = promising but uncertain. Always read the rationale for the full picture.Innovation & Novel Patterns
Detected Innovation Areas
1. Adapting proven human trust systems to autonomous AI agents. The bitcoin-otc/deedbot WoT was designed for pseudonymous humans making manual assessments. The interaction ledger adapts the same data model (source, target, score, rationale, timestamp) but automates the observation layer — the agent records interactions as a byproduct of doing work, not as a separate manual step. No agent runtime has done this grounded in actual WoT prior art; most agent trust proposals start from theoretical frameworks (DIDs, VCs, on-chain registries) rather than from systems that demonstrably worked for a decade.
2. LLM-as-judge with mandatory rationale (Chain-of-Thought trust assessment). The agent doesn't just record a number — it uses its LLM reasoning to produce a structured trust score and rationale explaining why it assigned that score. This is effectively Chain-of-Thought applied to trust assessment. The rationale remains the primary signal (mirroring bitcoin-otc's "notes > numbers" lesson), while the trust score provides a structured behavioral summary that enables threshold policies, visualization, and queryable filtering. Because the rationale is generated by the LLM, it can capture nuanced patterns like reputation farming that a simple heuristic would miss. This is a novel application of LLM reasoning to an interpersonal trust problem.
3. Local-first sovereignty as both a design constraint AND a security property. Most agent trust proposals (ERC-8004, Solana Agent Registry, Google A2A) are centralized or on-chain by default. The interaction ledger deliberately inverts this: the agent's trust memory is private, local, and unilateral. This is not just a sovereignty choice — it's a Sybil defense [11]. When each agent has a partial, different view of the network, attackers must maintain distinct personas for each audience, creating exponential coordination overhead. Centralizing this information (Phase 3) partially undoes this natural defense, which is why the aggregation protocol must be designed carefully.
Competitive Landscape
No existing system combines: (a) local-first private storage, (b) LLM-generated rationale as primary signal, (c) bitcoin-otc-proven data model, (d) automatic recording via plugin hooks. Each element exists separately in the landscape; the combination is novel.
Validation Approach
Innovation Risk Mitigation
CLI Tool / Developer Tool Specific Requirements
Project-Type Overview
The interaction ledger is a Cobot plugin following established architecture patterns (PluginMeta, capability interfaces, extension points). It exposes functionality through three interfaces: LLM tools (ToolProvider), CLI commands (Click), and extension points (hooks other plugins can consume).
Command Structure
CLI Commands (Click, registered under
cobot ledgersubgroup):cobot ledger listcobot ledger show <peer>cobot ledger summary [<peer>]Output is human-readable text to stderr (matching Cobot's existing CLI patterns). SQLite DB is directly queryable for programmatic access.
LLM Tool Definitions (ToolProvider)
query_peerpeer_id: strassess_peerpeer_id: str, trust: int, rationale: strlist_peerslimit: int = 20The
assess_peertool definition embeds the scoring rubric and rationale writing instructions in the function description (OpenAI function calling format). See Assessment Architecture for the full tool JSON, hybrid approach rationale, and operator calibration guidance.Extension Points (Defined by Ledger Plugin — Moved to Phase 1)
Originally deferred to Phase 2 ("add when consumer exists"). Moved to Phase 1 because the Observability Plugin (
_bmad-output/planning-artifacts/observability-plugin/prd.md) is the consumer. See Epic 4, Story 4.1 in epics.md.ledger.after_record{peer_id, direction, interaction_id}ledger.after_assess{peer_id, info_score, trust, rationale, assessment_id}Configuration Schema
Technical Architecture Considerations
open()/close()lifecycle.peers,interactions,assessments) with foreign keys and indexes onpeer_idandcreated_at.loop.on_message,loop.after_send,loop.transform_system_prompt.ledger.after_record,ledger.after_assess.config. Optional:workspace(for DB path resolution; falls back to~/.cobot/workspace/).Implementation Considerations
contextvars.ContextVar. Avoids the_current_sender_idrace condition on concurrent messages. The sender context is set inon_messageand read inafter_sendandtransform_system_prompt—ContextVarensures per-task isolation.loop.on_messagecontext: Nostr hex pubkey, Telegram user ID, or FileDrop agent name. Stored in a channel-agnosticpeer_idcolumn.cobot/plugins/ledger/tests/test_plugin.py. DB tests (schema, CRUD, constraints) + plugin tests (hooks, tools, prompt enrichment).Assessment Architecture
Assessment guidance is split across two locations to minimize context clutter while preserving LLM judgment capability:
transform_system_prompt(dynamic, ~60-120 tokens per known peer)assess_peertool definition (static, seen only when LLM considers tool use)Peer Context Injection
Injected into system prompt per sender via
transform_system_prompt:The score guide is static (~40 tokens) and included once per prompt injection. It ensures the LLM can interpret the scores without needing the full assessment rubric (which lives in the
assess_peertool definition).For first-contact peers:
"First contact — no prior history."(score guide still included so the LLM understands the scale if it encounters scores viaquery_peerorlist_peerstool responses). For non-peer messages (cron, stdin): no injection.Trust Scoring Rubric (assess_peer Tool Definition)
Why Hybrid Over Alternatives
Operator Calibration
If the LLM under-assesses (never calls the tool), operators add guidance to SOUL.md: "After completing a meaningful interaction with a peer, consider using assess_peer." If it over-assesses, add: "Only assess after significant milestones, not routine messages." No architectural changes needed — behavioral tuning via prompt.
Project Scoping & Phased Development
MVP Strategy & Philosophy
MVP Approach: Problem-solving MVP — prove that a Cobot agent with the interaction ledger makes measurably different decisions than one without it.
Resource Requirements: Single developer. The plugin is ~4 files (db.py, plugin.py, init.py, test_plugin.py), estimated ~400-600 LOC including tests. No external dependencies, no infrastructure, no deployment changes.
MVP Validation Test: Two Cobot agents communicating via FileDrop. Agent A has the ledger enabled. Agent B sends mixed-quality interactions (some reliable, some not). Validation: Agent A's responses demonstrably differ based on accumulated peer history — it prioritizes known-good peers and declines/deprioritizes known-bad peers.
MVP Feature Set (Phase 1)
Core User Journeys Supported:
Must-Have Capabilities:
loop.on_message+loop.after_sendloop.transform_system_promptassess_peertoolquery_peertoollist_peerstoolExplicitly NOT in MVP:
Moved to Phase 1 — Observability Plugin is the consumer (Story 4.1)ledger.after_record/ledger.after_assessextension pointsPost-MVP Features
Phase 2 (Growth):
Extension points (ledger.after_record,ledger.after_assess)Consumer plugin existsPhase 3 (Expansion):
assess_channelslist to control which channels trigger assessments (see Open Questions)Risk Mitigation Strategy
Technical Risks:
assess_peerreliablycontextvars.ContextVar— wrong attribution in a trust system is unacceptableMarket Risks:
Resource Risks:
Functional Requirements
Peer Tracking
Interaction Recording
Peer Assessment
Context-Informed Decision Making
LLM Tool Interface
Operator Auditability
Plugin Architecture
cobot.ymlfor optional settings (database path, max message length, excluded senders).Non-Functional Requirements
Performance
list,show,summary) complete in < 500ms for databases with up to 100,000 rows, as measured by subprocess timing in integration tests.assess_peertool description, not in the system prompt (hybrid approach — see Assessment Architecture).Security & Privacy
max_message_lengthconfiguration caps storage for operators with constraints. The ledger database is protected by filesystem permissions (NFR6), not by data truncation.Reliability & Data Integrity
rationalefield on assessments is NOT NULL — the database rejects assessments without rationale.Integration & Compatibility
cobot/plugins/).start()/stop(), syncconfigure(),create_plugin()factory, co-located tests,self.log_*()for logging.ruff checkandruff formatwith zero warnings, consistent with the existing codebase.Open Questions
Should the agent assess human users differently than agent peers?
MVP decision: Assess everyone. The ledger records interactions and assessments for all non-synthetic senders regardless of channel — Telegram users, Nostr contacts, FileDrop agents. The scoring rubric (behavioral reliability: responsiveness, follow-through, quality) applies to any counterparty. An agent that remembers "this Telegram user sends clear requests and responds to clarifications quickly" serves that user better over time.
Unresolved tension: Agent-to-agent is peer-to-peer. User-to-agent is employer/customer-to-service. Assessing human users raises questions:
Counter-argument: All of these concerns are deployment-context dependent. A public-facing Telegram bot serving strangers absolutely benefits from behavioral memory. David's personal bot assessing David himself is odd. A team bot assessing team members is somewhere in between.
Future feature (Phase 3): Per-channel assessment policy — a configurable
assess_channelslist that lets operators control which channels trigger assessments. Default would remain "all channels" but operators could restrict to agent-to-agent channels only. This requires operational experience to determine the right defaults.Decision needed after MVP: Once the ledger is running and we observe how assessments play out across different channel types, revisit whether per-channel policy is needed or whether "assess everyone" remains the right default.
Appendix A: Score Semantics — Why Both Scores
This appendix documents the analysis behind the dual-score decision. The main PRD adopts both info_score and trust (see Score Semantics).
The Original Tension
Two scoring philosophies existed in the bitcoin-otc ecosystem:
Information Quality (MP's canonical definition [1] [15]):
The score measures "the scorer's confidence that the information he has about scoree is correct, accurate, relevant and complete." The score says nothing about whether the peer is good or bad — that's entirely in the rationale. MP's redefinition was post-hoc (circa 2012-2014); the original bitcoin-otc system provided ambiguous guidance about what scores meant [15].
Behavioral Prediction (community practice [4] [7]):
The score measures "confidence that this peer will behave reliably in future interactions." The score itself carries the behavioral signal. This is how the bitcoin-otc community actually used the system: +10 = "fully trustworthy," -10 = "known scammer." The Stanford SNAP dataset (35,592 edges) captures behavioral scores. All academic literature analyzes the data behaviorally.
Why Not Choose — Why Both
The original PRD framed this as an either/or choice and selected info-quality. Analysis revealed that this created an internal inconsistency: The Simulation & Observability PRD's visualization — which requires edge coloring based on trust quality — is impossible under pure info-quality scoring, since a known scammer with 34 interactions would have info_score ~6-7 (edge = green).
The dual-score model resolves the tension by recognizing that information quality and behavioral judgment are orthogonal dimensions, not competing alternatives:
Each score answers a different question. info_score: "how seriously should I take this assessment?" trust: "what is the behavioral signal?" rationale: "what specifically happened?"
Strengths of Each Score (Preserved in Dual Model)
info_score strengths:
trust strengths:
Ripple Defense: Why Dual Scoring Is Safe
The Ripple teardown [9] argued that collapsing trust into a single aggregatable number destroys information. The dual-score model prevents this:
L1/L2 Trust Walkthrough (Dual-Score Model)
L1 (direct): Peers the agent has interacted with personally. All MVP assessments are L1.
L2 (transitive): Peers known through trusted intermediaries. Phase 3: agent queries its network.
Scenario: Agent wants to interact with Alice (unknown). Queries 4 trusted peers.
The dual-score advantage: The agent uses info_score as a confidence weight (how seriously to take each peer's input), trust as a quick behavioral filter (overall signal direction), and rationale for nuanced decision-making. No single number dominates — all three layers contribute.
References
;;ratecommand syntax, getrating vs gettrust queries. https://bitcoin-otc.com/trust.php | https://en.bitcoin.it/wiki/Bitcoin-OTC;;rateto make trust decisions, treating numeric scores as a quick filter. The Stanford SNAP dataset (5,881 nodes, 35,592 edges) captures scores but not notes, which itself illustrates the data loss when rationale is dropped. This is a design principle inspired by the system, not a formally proven finding. https://snap.stanford.edu/data/soc-sign-bitcoin-otc.htmlL/ltags for attaching labels to pubkeys and events. Custom namespaces supported. Quality metadata field (0-1 scale). Best fit for first-person agent assessments. https://github.com/nostr-protocol/nips/blob/master/32.mdproposal: Cobot Interaction Ledgerto proposal: Peer Interaction LedgerAnalysis: Interaction Ledger — Codebase Fit & Steelman Counterargument
How It Fits
The PRD is exceptionally well-researched and architecturally sound. It maps cleanly onto Cobot's existing patterns:
loop.on_message,loop.after_send,loop.transform_system_prompt— all exist and are used by logger, persistence, and trust plugins already. No new extension points needed for MVP.query_peer,assess_peer,list_peers— same pattern as knowledge, wallet, tools plugins.sqlite3is stdlib.The persistence plugin already tracks per-peer conversations (JSON files by npub hash). The ledger adds structured outcome records on top — complementary, not overlapping.
Steelman Case Against
1. The LLM-as-judge problem is the real risk.
The entire value proposition depends on the LLM reliably calling
assess_peerat appropriate moments AND producing quality rationales. This is fundamentally unpredictable:2. It's solving tomorrow's problem today.
Cobot currently has ~2 agents actively communicating (Alpha + Zeus via filedrop). The PRD's user journeys assume a rich ecosystem of 15+ peers with diverse behavior patterns. We're building a trust infrastructure for a network that doesn't exist yet. The risk: by the time we have enough agents to make this useful, the requirements will have changed based on what we learned from simpler interactions.
Counter-counter: this is also the bitcoin-otc argument — they built the rating system when the community was small, and it was ready when the community grew. Building it early means the data accumulates.
3. The persistence plugin overlap creates confusion.
We'll have TWO plugins tracking per-peer data:
persistence: full conversation history as JSON files (by npub hash)ledger: structured interactions + assessments in SQLite (by peer_id)They use different storage backends, different ID schemes, and both hook
on_message+after_send. An operator debugging peer interactions needs to check both places. Should persistence evolve into the ledger's storage layer instead of running alongside it?4. Assessment quality is unverifiable at MVP scale.
With 2-3 active agents, you can't statistically validate whether assessments are good. The reputation farmer scenario (Journey 2) is compelling in theory but requires enough interactions to create patterns. At MVP scale, every assessment is effectively a sample of one.
5. The
_current_sender_idrace condition is hand-waved.The PRD acknowledges this and defers to
contextvars.ContextVar. But this is a correctness issue — if two messages arrive near-simultaneously, the ledger could attribute interactions to the wrong peer. For a trust system, wrong attribution is worse than no attribution. This should be fixed in MVP, not deferred.Conclusion: Worth Pursuing — With Scope Reduction
Yes, build it. The architecture is solid, it fits Cobot's patterns perfectly, and the prior art grounding (bitcoin-otc, not theoretical frameworks) is the right approach. The PRD is one of the best-written proposals I've seen on this repo.
But reduce MVP scope further:
cobot ledger assess <peer> <score> <rationale>). This removes the LLM-as-judge risk entirely for MVP and still validates the data model._current_sender_idin MVP — usecontextvars.ContextVarfrom day one. It's ~5 lines of code.sender_idis a known peer — don't add the Assessment Protocol to every system prompt.This gives you the foundational layer (observe + distinguish) without betting on LLM judgment quality (judge) before we can validate it.
Overall: 👍 strong proposal, implement with the reduced scope, expand once validated.
Feedback on the PRD — two gaps
1. #bitcoin-assets references lack sources
The PRD makes several specific claims grounded in bitcoin-otc / #bitcoin-assets prior art:
source, target, score, notes, timestamp)These are presented as established facts but none are cited. The bitcoin-otc WoT is well-documented — there's a Stanford SNAP dataset with actual academic papers analyzing trust dynamics. The
;;ratecommand structure is verifiable from old IRC logs and the bitcoin-otc wiki.But "notes > numbers" specifically — is there a canonical source for this claim, or is it folk wisdom from the community? If it's the latter, the PRD should frame it as a design principle inspired by the system rather than a proven finding from it.
For a document that uses #bitcoin-assets as its primary justification, concrete citations would strengthen the argument significantly. At minimum: the SNAP dataset paper, the bitcoin-otc wiki, and the gribble/deedbot documentation.
2. Missing risk: context clutter from inline assessment
The PRD proposes injecting both the Assessment Protocol (static block — scoring rubric, "when NOT to assess", rationale guidelines) and Peer Context (dynamic per-sender data) into every system prompt. That's potentially 300-500 tokens of assessment instructions on every single LLM call, even when the agent is just answering a simple question.
The risks the PRD does list (LLM doesn't call
assess_peer, LLM over-assesses) are about assessment behavior. But context clutter — the cost of the approach itself — is missing:Alternative approaches worth comparing
loop.after_interaction) with a focused assessment promptassess_peertool but only uses it when explicitly guided by SOUL.mdRecommendation: The hybrid approach seems like the best tradeoff — inject peer context (cheap, useful for every interaction) but move the assessment protocol into the tool descriptions for
assess_peer. The LLM sees the scoring rubric when it considers using the tool, not on every single call. This gives you peer-aware decisions without the 400-token overhead on routine messages.The PRD should at minimum acknowledge context clutter as a risk and explain why inline injection was chosen over these alternatives.
Addendum: NIP references also need citations
The PRD references NIP-32 and NIP-85 as future export targets but doesn't link to them or verify the claims:
NIP-32 (Labeling) — kind 1985, draft status. Designed for distributed moderation, content classification, and labeling events/pubkeys. The PRD claims assessments can be exported as NIP-32 labels — this is plausible (you could label a pubkey with a trust score via an
ltag in augcnamespace), but NIP-32 has no built-in concept of scores or rationales. You'd need a custom label namespace, and thecontentfield would carry the rationale as free text. Workable but not a clean fit.NIP-85 (Trusted Assertions) — kind 30382, draft status. Designed for WoT service providers to publish signed assertion events about pubkeys. This is actually a better fit than NIP-32 for trust scores — it's specifically about trust calculations on pubkeys, with structured result tags. But NIP-85 is designed for service providers publishing aggregate WoT scores, not for individual agents publishing their own assessments. The PRD's use case (agent publishes its own first-person assessment of a peer) is closer to NIP-32 labeling than NIP-85 assertions.
The PRD should clarify which NIP maps to which export scenario, and acknowledge that neither is a perfect fit — both would need adapter logic. Same citation gap as the bitcoin-otc references: the claims are reasonable but unsubstantiated.
Reference Map — Primary Sources & Academic Validation
ledger list,ledger summary,ledger show) mirrors the three-view architecture of the Assbot WoT website spec (#217), which also introduced the "weight factor" metric — a precursor to the academic fairness/goodness formalization.Additional Contravex references (#221)
Four practical articles from Pete Dushenski (2014–2024) documenting real WoT-based economic activity — BTC loans taken from strangers, loans refused to people outside the WoT, and a critical redefinition of WoT score semantics by MP that diverges from the PRD's current definition. The "5 Ws of WoT" article contains an IRC conversation where MP clarifies that a WoT score measures "how well I am acquainted with this guy" (information quality), NOT "how trustworthy is this guy" (behavioral prediction) — a distinction the PRD should address explicitly.
nazim referenced this issue2026-03-07 04:53:06 +00:00
Re-Review: Updated PRD (2026-03-07)
David, this revision is a significant upgrade. Let me address your three questions: does it answer all concerns, how do the references map, and the score computation research gap.
Does It Answer All Previous Concerns?
Doxios concerns (my previous review):
contextvars.ContextVar. Explicitly called out in risk table: "wrong peer attribution in a trust system is unacceptable." Exactly what I asked for.Nazim concerns:
[[N]]notation. bitcoin-otc wiki, SNAP dataset, Trilema articles, academic papers (Kumar et al. ICDM 2016, WSDM 2018), NIP specs. The "notes > numbers" claim is now correctly framed as "a design principle inspired by the system, not a formally proven finding" (ref [4]).Verdict: All substantive concerns from both reviewers are addressed. The remaining gap (rationale quality metrics) is acceptable as a Growth feature.
Reference Map: How #211 Handles Each Issue
I'll comment on each reference issue individually below.
Score Computation Research
The PRD acknowledges the score formula is an MVP heuristic and flags Phase 2 research. I'll create a dedicated issue for this now.
Continued in follow-up comments on #213-#220.
Score Computation Research Issue Created
Filed #222 — "research: Formalize information-quality score computation formula"
Covers:
This is the Phase 2 research task the PRD flagged. MVP ships with the heuristic table; #222 formalizes it.
Reference Map Summary
I've commented on each reference issue (#213-#220) explaining how #211 handles them. Quick overview:
Overall: the updated PRD integrates all 8 references substantively, not superficially. Each reference influences specific design decisions, and gaps are honestly acknowledged with phase-appropriate deferral.
@nazim — the PRD has been significantly restructured since your review. All your concerns (citations, NIP verification, context clutter) are addressed. The reference issues (#213-#220) are now mapped back to specific PRD decisions. Worth a re-read with this context.
Additional reference: The Wasteland (#222)
Steve Yegge's "Wasteland" (March 2026) independently arrives at nearly identical trust primitives — evidence-backed assessments, trust ladders, fraud topology detection, federated reputation — but from the opposite direction: public/centralized/gamified vs. our sovereign/local/cryptographic approach. Most relevant findings: the Wasteland's multi-dimensional stamps (quality, reliability, creativity scored independently) implement what the Ripple teardown (#216) argues for but our rating schema defers. Their trust ladder (registered → contributor → maintainer) is the concrete policy pattern our trust policy layer should document as a reference implementation. Neither project cites the bitcoin-otc prior art.
Re-Review v3: Dual-Score Model (2026-03-08)
The Big Change
The previous version chose info_score only (deterministic) and deferred behavioral judgment to the rationale. This version introduces a dual-score model:
info_score(0-10, deterministic) +trust(-10 to +10, LLM-provided) + mandatory rationale.This is a significant architectural pivot. Let me reassess.
Does It Answer All Previous Concerns?
New: My Assessment of the Dual-Score Decision
I think this is the right call. Here's why:
1. It resolves a real internal inconsistency. The previous version claimed info_score-only but Journey 1 showed the agent assigning
score +2for behavioral reasons. The user journeys were behavioral; the Score Semantics section was info-quality. The dual model resolves this tension honestly rather than pretending it doesn't exist.2. It matches how bitcoin-otc actually worked. MP redefined the semantics post-hoc, but the community used scores behaviorally for years. Both interpretations had value. Choosing one was a false dichotomy — the dual model takes both.
3. The FG algorithm mapping is elegant. info_score → Fairness (rater reliability based on interaction depth), trust → Goodness (ratee quality based on behavioral judgment). Having both local dimensions gives Phase 3 structured inputs to BOTH sides of the FG computation. The single-score version would have required extracting behavioral signal from rationale text — lossy and expensive.
4. The Ripple defense is solid. The export constraint (trust MUST NOT be exported without rationale and info_score) prevents the single-number-averaging failure mode. info_score handles cross-agent composability. Trust stays local. This is architecturally sound.
My one concern: The LLM providing the trust score reintroduces model-dependence. Claude might rate a peer +3 where GPT-4 rates them +6 for the same interaction history. Mitigation: the rationale explains the reasoning, and operators can audit the score-rationale consistency. Inconsistent models produce inconsistent trust scores but consistent info_scores — the dual model degrades gracefully.
Reference Map Reassessment
Does the dual-score model change any reference analysis?
ledger listcan show both columns — more actionable than info_score alone.Summary: The dual-score model strengthens the reference integration across the board. No reference arguments weaken. #219 (FG) and #220 (REV2) benefit most significantly.
Verdict
The PRD is ready for implementation. The dual-score model resolves the tension between theory (info-quality) and practice (behavioral judgment) by preserving both as orthogonal dimensions. All reviewer concerns remain addressed. The reference map is strengthened, not weakened.
One suggestion: update #222 (score computation research) to include the trust score calibration question — how do we detect/compensate for model-dependent trust scoring across different LLMs? This is a real operational concern for Phase 2+.
🦊