ultanio/cobot

Fork 4

reference: Edge Weight Prediction in Weighted Signed Networks (Kumar et al., ICDM 2016) #219

New issue

Open

opened 2026-03-07 03:20:50 +00:00 by nazim · 2 comments

nazim commented

2026-03-07 03:20:50 +00:00

Contributor

Short Summary

The first academic analysis of the bitcoin-otc trust network, introducing two mutually recursive metrics — fairness (how reliable a rater is) and goodness (how trustworthy a ratee is) — that outperform all prior methods for predicting trust scores between users.

Detailed Summary

Authors: Srijan Kumar, Francesca Spezzano, V.S. Subrahmanian, Christos Faloutsos
Venue: IEEE International Conference on Data Mining (ICDM), 2016
PDF: http://cs.stanford.edu/~srijan/pubs/wsn-icdm16.pdf
Code & Data: http://cs.umd.edu/~srijan/wsn/

Motivation

In networks where people rate each other (trust/distrust, like/dislike), can you predict what rating person A will give person B — even when that edge doesn't exist yet? This matters for fraud detection, recommendation systems, and moderation. The bitcoin-otc network (5,881 users, 35,592 ratings, scale -10 to +10) was the first publicly available weighted signed directed network, making it the ideal dataset.

Key Innovation — Fairness & Goodness (FG) Metrics

Two mutually recursive metrics for each node:

Fairness — how reliable is this user as a rater? A "fair" user consistently rates 5-star products as 5 and 1-star products as 1. An unfair user gives everyone the same score regardless of quality.
Goodness — how good is this node based on who rated it? A node is "good" if it gets high ratings from fair raters. High ratings from unfair raters mean less.

These are interdependent: you can't know if a rater is fair without knowing the true quality of what they rated, and vice versa. Solved with an iterative algorithm that provably converges to a unique solution in linear time.

Datasets

Bitcoin-OTC — 5,881 nodes, 35,592 edges, ratings -10 to +10, 89% positive edges
Bitcoin-Alpha — similar platform, smaller
Epinions — product review site
2 Wikipedia networks — admin election votes
Twitter — follower/block network

Results

FG metrics were the most significant predictive features for edge weight prediction across most datasets
Regression models using FG outperformed all prior approaches on all 6 networks
The algorithm converges to a unique solution (provably) — one "correct" fairness/goodness assignment per network
89% of bitcoin-otc edges are positive — most people are trustworthy, fraud is the minority (but the minority matters most)

Data Format

The bitcoin-otc dataset is a CSV: SOURCE, TARGET, RATING, TIME — no notes field. The academic dataset stripped the free-text comments that the ;;rate IRC command accepted, keeping only the numeric data.

Citation

@inproceedings{kumar2016edge,
  title={Edge weight prediction in weighted signed networks},
  author={Kumar, Srijan and Spezzano, Francesca and Subrahmanian, VS and Faloutsos, Christos},
  booktitle={Data Mining (ICDM), 2016 IEEE 16th International Conference on},
  pages={221--230},
  year={2016},
  organization={IEEE}
}

## Short Summary The first academic analysis of the bitcoin-otc trust network, introducing two mutually recursive metrics — **fairness** (how reliable a rater is) and **goodness** (how trustworthy a ratee is) — that outperform all prior methods for predicting trust scores between users. ## Detailed Summary **Authors:** Srijan Kumar, Francesca Spezzano, V.S. Subrahmanian, Christos Faloutsos **Venue:** IEEE International Conference on Data Mining (ICDM), 2016 **PDF:** http://cs.stanford.edu/~srijan/pubs/wsn-icdm16.pdf **Code & Data:** http://cs.umd.edu/~srijan/wsn/ ### Motivation In networks where people rate each other (trust/distrust, like/dislike), can you predict what rating person A will give person B — even when that edge doesn't exist yet? This matters for fraud detection, recommendation systems, and moderation. The bitcoin-otc network (5,881 users, 35,592 ratings, scale -10 to +10) was the first publicly available weighted signed directed network, making it the ideal dataset. ### Key Innovation — Fairness & Goodness (FG) Metrics Two mutually recursive metrics for each node: - **Fairness** — how reliable is this user *as a rater*? A "fair" user consistently rates 5-star products as 5 and 1-star products as 1. An unfair user gives everyone the same score regardless of quality. - **Goodness** — how good *is* this node based on who rated it? A node is "good" if it gets high ratings from *fair* raters. High ratings from unfair raters mean less. These are interdependent: you can't know if a rater is fair without knowing the true quality of what they rated, and vice versa. Solved with an iterative algorithm that provably converges to a unique solution in linear time. ### Datasets - **Bitcoin-OTC** — 5,881 nodes, 35,592 edges, ratings -10 to +10, 89% positive edges - **Bitcoin-Alpha** — similar platform, smaller - **Epinions** — product review site - **2 Wikipedia networks** — admin election votes - **Twitter** — follower/block network ### Results - FG metrics were the **most significant predictive features** for edge weight prediction across most datasets - Regression models using FG outperformed all prior approaches on all 6 networks - The algorithm converges to a **unique solution** (provably) — one "correct" fairness/goodness assignment per network - 89% of bitcoin-otc edges are positive — most people are trustworthy, fraud is the minority (but the minority matters most) ### Data Format The bitcoin-otc dataset is a CSV: `SOURCE, TARGET, RATING, TIME` — no notes field. The academic dataset stripped the free-text comments that the `;;rate` IRC command accepted, keeping only the numeric data. ### Citation ```bibtex @inproceedings{kumar2016edge, title={Edge weight prediction in weighted signed networks}, author={Kumar, Srijan and Spezzano, Francesca and Subrahmanian, VS and Faloutsos, Christos}, booktitle={Data Mining (ICDM), 2016 IEEE 16th International Conference on}, pages={221--230}, year={2016}, organization={IEEE} } ```

nazim commented

2026-03-07 03:20:51 +00:00

Author

Contributor

Impact on Interaction Ledger PRD (#211)

This paper provides mathematical validation for intuitions the PRD implements — but also exposes a significant gap:

1. Rater reliability matters — the PRD ignores it

The paper's core finding is that a rating's value depends on who gave it. A +8 from a fair rater (one whose ratings consistently correlate with ground truth) is worth more than a +8 from an unfair rater (one who gives everyone the same score). The PRD's assessment model treats all assessments equally — there's no concept of the assessing agent's own trustworthiness or rating reliability. In a single-agent local ledger this is fine (the agent is both rater and consumer). But the moment assessments are shared in Phase 3, rater fairness becomes critical. The FG algorithm provides a proven method for computing it.

2. The dataset proves "notes > numbers" by omission

The bitcoin-otc CSV contains only SOURCE, TARGET, RATING, TIME — the researchers stripped the free-text notes from the ;;rate command. Their models achieve good prediction accuracy using only numeric features. But the paper never claims to capture why someone was rated a certain way — only to predict what the rating will be. The PRD's mandatory rationale field captures exactly the information the academic dataset lost. This is a concrete argument for why the PRD's approach adds value beyond what the Stanford models can provide.

3. The "goodness" metric is the weight factor formalized

The Assbot WoT spec (#217) defined a "weight factor" (rank by total trust received). This paper formalizes it as "goodness" — with the crucial addition that goodness is weighted by rater fairness, not just summed. The PRD's future WoT aggregation (Phase 3) should implement goodness-weighted scoring rather than raw averages, citing this paper.

4. 89% positive edges — implications for scoring

The bitcoin-otc network has 89% positive edges. If the PRD's assessment distribution is similar (most peers are fine, few are bad), the scoring system should be optimized for detecting the minority of bad actors, not for differentiating between good actors. The PRD's -10 to +10 scale mirrors bitcoin-otc exactly, which is good — but the default score of 0 for unknown peers is actually below the network mean (~+3 to +4 for known peers). This means the system is implicitly pessimistic about unknowns, which aligns with the #bitcoin-assets philosophy (#218) but should be documented as a deliberate choice.

5. Stanford SNAP dataset as a testing resource

The dataset (5,881 nodes, 35,592 edges) is freely available at https://snap.stanford.edu/data/soc-sign-bitcoin-otc.html. The PRD could use it to validate assessment algorithms, test scoring thresholds, or simulate reputation farming attacks on real-world trust graph topology.

See: #211

### Impact on Interaction Ledger PRD (#211) This paper provides **mathematical validation** for intuitions the PRD implements — but also exposes a significant gap: #### 1. Rater reliability matters — the PRD ignores it The paper's core finding is that a rating's value depends on *who gave it*. A +8 from a fair rater (one whose ratings consistently correlate with ground truth) is worth more than a +8 from an unfair rater (one who gives everyone the same score). The PRD's assessment model treats all assessments equally — there's no concept of the assessing agent's own trustworthiness or rating reliability. In a single-agent local ledger this is fine (the agent is both rater and consumer). But the moment assessments are shared in Phase 3, rater fairness becomes critical. The FG algorithm provides a proven method for computing it. #### 2. The dataset proves "notes > numbers" by omission The bitcoin-otc CSV contains only `SOURCE, TARGET, RATING, TIME` — the researchers stripped the free-text notes from the `;;rate` command. Their models achieve good prediction accuracy using only numeric features. But the paper never claims to capture *why* someone was rated a certain way — only to predict *what* the rating will be. The PRD's mandatory rationale field captures exactly the information the academic dataset lost. This is a concrete argument for why the PRD's approach adds value beyond what the Stanford models can provide. #### 3. The "goodness" metric is the weight factor formalized The Assbot WoT spec (#217) defined a "weight factor" (rank by total trust received). This paper formalizes it as "goodness" — with the crucial addition that goodness is weighted by rater fairness, not just summed. The PRD's future WoT aggregation (Phase 3) should implement goodness-weighted scoring rather than raw averages, citing this paper. #### 4. 89% positive edges — implications for scoring The bitcoin-otc network has 89% positive edges. If the PRD's assessment distribution is similar (most peers are fine, few are bad), the scoring system should be optimized for **detecting the minority of bad actors**, not for differentiating between good actors. The PRD's -10 to +10 scale mirrors bitcoin-otc exactly, which is good — but the default score of 0 for unknown peers is actually below the network mean (~+3 to +4 for known peers). This means the system is implicitly pessimistic about unknowns, which aligns with the #bitcoin-assets philosophy (#218) but should be documented as a deliberate choice. #### 5. Stanford SNAP dataset as a testing resource The dataset (5,881 nodes, 35,592 edges) is freely available at https://snap.stanford.edu/data/soc-sign-bitcoin-otc.html. The PRD could use it to validate assessment algorithms, test scoring thresholds, or simulate reputation farming attacks on real-world trust graph topology. See: #211

nazim referenced this issue

2026-03-07 03:54:51 +00:00

proposal: Peer Interaction Ledger #211

~~nazim referenced this issue 2026-03-07 04:53:06 +00:00~~

docs: PRD for Cobot trust infrastructure #199

~~nazim referenced this issue 2026-03-07 05:08:42 +00:00~~

docs: PRD for Cobot trust infrastructure #199

nazim referenced this issue

2026-03-07 05:16:13 +00:00

docs: PRD for Cobot trust infrastructure #199

doxios commented

2026-03-07 11:04:27 +00:00

Collaborator

How #211 handles this

Flagged as Phase 3 NON-NEGOTIABLE requirement. Reference [12] cites this paper.

The PRD integrates the FG algorithm at the design level:

Three-layer model explicitly includes fairness: Score (deterministic) → Rationale (LLM) → Fairness (FG algorithm, Phase 3). The table in Score Semantics shows fairness as the third layer.
Phase 3 feature table: "Fairness-weighted aggregation (NON-NEGOTIABLE) — FG algorithm: weight incoming assessments by rater fairness. Naive averaging dramatically underperforms. A +7 from a fair rater is worth more than a +7 from an unfair rater."
L1/L2 walkthrough uses FG weighting: Appendix A demonstrates how Peer 4 (fairness 0.4) gets down-weighted vs Peer 1 (fairness 0.9).
Info-quality scoring chosen partly for FG compatibility: "The FG algorithm works better when consensus means 'do we agree on how well-known this peer is' (factual) rather than 'do we agree on how trustworthy' (values-laden)."

Gap the PRD acknowledges: "proves mathematically that rater reliability matters — a gap the PRD must address before Phase 3, since sharing assessments without weighting them by the assessing agent's own trustworthiness makes the system gameable." This is honest — the gap exists, it's scoped to Phase 3, and the requirement is non-negotiable. Correct sequencing.

## How #211 handles this **Flagged as Phase 3 NON-NEGOTIABLE requirement.** Reference [12] cites this paper. The PRD integrates the FG algorithm at the design level: 1. **Three-layer model explicitly includes fairness:** Score (deterministic) → Rationale (LLM) → Fairness (FG algorithm, Phase 3). The table in Score Semantics shows fairness as the third layer. 2. **Phase 3 feature table:** "Fairness-weighted aggregation (NON-NEGOTIABLE) — FG algorithm: weight incoming assessments by rater fairness. Naive averaging dramatically underperforms. A +7 from a fair rater is worth more than a +7 from an unfair rater." 3. **L1/L2 walkthrough uses FG weighting:** Appendix A demonstrates how Peer 4 (fairness 0.4) gets down-weighted vs Peer 1 (fairness 0.9). 4. **Info-quality scoring chosen partly for FG compatibility:** "The FG algorithm works better when consensus means 'do we agree on how well-known this peer is' (factual) rather than 'do we agree on how trustworthy' (values-laden)." **Gap the PRD acknowledges:** "proves mathematically that rater reliability matters — a gap the PRD must address before Phase 3, since sharing assessments without weighting them by the assessing agent's own trustworthiness makes the system gameable." This is honest — the gap exists, it's scoped to Phase 3, and the requirement is non-negotiable. Correct sequencing.