BABEL

Benchmark for Autonomous Bridge Evaluation and Localization

As agent compositions scale, a new failure mode emerges: every component passes local validation — schema checks, type checks, pairwise tests — but the end-to-end output is semantically wrong. Settlement dates shift by one business day. Escalations fire thirty minutes early. Fees are computed on the wrong base. The error is systematic, subtle, and invisible to current evaluation methods. Documented losses from this failure mode total billions in aggregate across aerospace, finance, payments, healthcare, software, and energy — from the $327M Mars Climate Orbiter to recurring 100× Stripe overcharges to $4B+ in LNG arbitration.

0.03R² schema validation
≥ 0.86R² structural method
100%false assurance rate
0.17ρ standard eval

932 instances across 7 workflow families, 3 provenance tiers, 3 evaluation tracks. LLM and reasoning-model baselines in the leaderboard below.

Start with live validation and the Oracle condition below. The remaining sections explain the benchmark surface, results, and how to reproduce or challenge it.

Can Agents Self-Diagnose Compositional Risk?

New experiment

We gave Claude Sonnet 4, GPT-4o, and Gemini 2.5 Pro the actual tool schemas — exactly what an orchestrating agent would see — and asked them to find convention risks. Aggregate recall: 25.6% vs the structural baseline of 100%. Under adversarial prompting, GPT-4o declared 100% of risky compositions safe.

51 compositions across three workflow families (invoice, calendar, policy), each with 7 tools and known convention mismatches. Three conditions: schema-only, with explicit dimension hints, and adversarial (“our team believes these tools are compatible”). 459 total evaluations.

ModelSchema-onlyWith hintsAdversarial FA
Claude Sonnet 437.8% recall63.8% recall52.9%
GPT-4o12.5% recall67.7% recall100.0%
Gemini 2.5 Pro3.4% recall2.0% recall2.0%
Structural diagnostic100%100%0%

Precision was 97.4% across all conditions — when models flag a risk, they are almost always correct. The failure mode is not hallucination; it is blindness. A 200-line structural analysis catches what frontier models miss. FA = false assurance rate (fraction of risky compositions declared safe).

Note on Gemini 2.5 Pro: Gemini's recall decreases with dimension hints (3.4% → 2.0%), which is counterintuitive. Investigation of the raw API responses reveals a thinking-model truncation pattern: Gemini allocates ~2,400 internal reasoning tokens per call, leaving little budget for visible output within the 1,500-token cap. Median visible response length was 4–7 characters. The model reasons extensively but the visible answer is truncated before substantive content is emitted. This is itself a finding: thinking models may produce worse observable outputs under constrained token budgets, a form of the coherence fee at the model level.

Live Validation Against Measured Loss

Key result

Live-pipeline validation: structural holonomy and convention distance both correlate with measured dollar loss at ρ = 0.423 (p < 1.5 × 10⁻⁴, N = 75). Holonomy explains more linear variance (R² = 0.119 vs 0.070). On a cyclic 4-server calendar pipeline (β₁ ≥ 3, N = 192), convention mismatches produce 7.5× higher pipeline inconsistency.

Invoice: 5 clusters, N = 75, p < 1.5 × 10⁻⁴. Convention distance achieves the same rank correlation. Calendar: 4 servers, 64 configs, N = 192. Distance stronger on composite (ρ = 0.39); holonomy captures temporal error distance misses (ρ = 0.14 vs 0.02). Replication pack available on request

The central empirical question for any benchmark of this kind is whether the severity signal tracks real downstream error in live compositions — not just internal benchmark structure.

In the live-pipeline validation, parameterized convention assignments were run through actual MCP server pipelines and the resulting dollar errors were measured directly. Matched conventions produced zero error; mismatched conventions produced errors from 0.99× to 9,999×. The structural diagnostic predicted the severity ordering across these configurations.

A second cyclic validation on a 4-server calendar composition (β₁ ≥ 3 independent cycles, N = 192) confirms convention mismatches produce 7.5× higher pipeline inconsistency in cyclic compositions. Holonomy captures temporal-error structure that simple convention distance misses (ρ = 0.14, p = 0.048 vs ρ = 0.02, p = 0.81).

This is the strongest current evidence that the benchmark’s severity signal connects to measured operational harm across both acyclic and cyclic compositions. Broader validation across more workflows and independent replication remain the decisive next step.

What the Oracle Condition Shows

Key finding

When cycle structure and restriction matrices are provided, rank-order judgment improves substantially for some models. Calibrated severity prediction remains weak for all. The failure decomposes into structure recovery and magnitude computation.

Five frontier models across three providers were evaluated under standard prompting and all achieved negative R² — worse than predicting the mean. That headline is easy to misread as a generic claim that language models fail at this task.

The Oracle condition changes the interpretation. When pre-computed cycle structures and restriction matrices are supplied, several models show substantially improved ranking performance while magnitude prediction remains broken. This suggests the failure is not monolithic: part of the problem is recovering the relevant compositional structure from the workflow description, and part is reasoning accurately about the severity induced by that structure.

ModelStandard ρOracle ρΔρOracle R²
Claude Sonnet 40.120.80+0.68−4.0
GPT-4o0.050.39+0.34−5.3
Opus 4−0.350.35+0.70−0.42
Codex 5.2−0.130.26+0.39−35.7
Gemini 3.1 Pro−0.110.04+0.15−129

All models evaluated at N=50 under Oracle condition. Standard condition: N=18–50 depending on cost. Codex 5.2 failed to produce parseable output 64% of the time under Oracle prompting.

The asymmetry is the finding: structure helps ranking, but calibration still breaks. That distinction matters for any downstream system that needs not just ordering but magnitude.

What BABEL Measures

BABEL targets the failure mode that survives all local checks: schema validation passes, type checking passes, pairwise contract testing passes, but the composed output is wrong because latent convention mismatches accumulate around cycles in the composition graph. Settlement date interpreted as initiation versus arrival. Fee computed on gross versus net. Timezone anchored to UTC versus organizer-local. ACT/365 versus 30/360 day-count conventions. Each bilateral handoff looks correct; the error closes only around the full loop.

The benchmark evaluates three capabilities that matter in practice:

Track A: PredictionEstimate overall severity of semantic composition failure. Scored by Spearman ρ and R² with ground-truth holonomy.
Track B: LocalizationIdentify which edge carries the critical convention mismatch. Scored by Precision@K for the highest-frustration edges.
Track C: RepairUnder a fixed repair budget (K = 1, 2, 3, 5, 8), prescribe which edges to fix first. Scored by failure reduction.

Instances span three provenance tiers: controlled synthetic families (Tier 1), workflow-shaped and real-MCP families with actual servers communicating via stdio (Tier 2), and API-derived convention surfaces grounded in Stripe, Twilio, GitHub, Shopify, Slack, Plaid, QuickBooks, and others (Tier 3). The benchmark is designed to be beaten.

Leaderboard and Results

Severity Prediction (Track A)

The severity table shows where topological complexity separates methods. Simpler baselines approach the structural diagnostic on uniform instances but collapse on heterogeneous families.

R² with ground-truth holonomy per family. Dev split (514 instances). Higher is better.

#MethodTypeSynthInvoiceCalendarPolicyExternal
1structural_sheafReference0.9930.9780.9580.9791.000
2cycle_frustration_plainSheaf-free0.9650.3420.6030.4190.621
3weighted_graph_incon.Sheaf-free0.7830.1200.5540.2160.616
4bounded_depth_8Conventional0.1730.0500.4380.1260.610
5graph_topologyConventional0.1520.1270.2510.0000.625

Simpler cycle-aware methods approach the structural diagnostic on uniform synthetic instances (0.965 vs 0.993) but collapse on heterogeneous families (0.006–0.734 vs 0.86+). The gap emerges specifically with topological complexity.

Frontier LLM Baselines (Standard Prompting)

Standard prompting produces negative R² across all five frontier models — worse than predicting the mean. The Oracle condition above decomposes why.

ModelProviderSpearman ρN
Claude Sonnet 4Anthropic−1340.1250
GPT-4oOpenAI−1300.0550
Codex 5.2OpenAI−134−0.1318
Claude Opus 4Anthropic−133−0.3520
Gemini 3.1 ProGoogle−0.6−0.1142

Live API calls via OpenRouter. All five models achieve negative R². Anthropic and OpenAI models overpredict (~0.7); Gemini underpredicts (0.0). See the Oracle condition above for the decomposition of this failure.

Real-MCP Results — Protocol-Green, Semantics-Red

Actual MCP servers communicating via stdio transport. All 44 protocol checks pass. Semantic failures persist. This is the benchmark’s operational core.

TrackServersSheaf R²Best Conv. R²K=1 RepairK=8 Repair
Bronze+ (Calendar)3 custom + 1 official0.9400.610+11.3%+60.1%
Silver (Invoice)2 custom + 2 non-house0.8610.734+11.2%+83.5%

Bronze+: 3 custom FastMCP servers + official Memory reference server. Silver: 2 custom + MarkItDown Docker MCP + official Memory server. Protocol checks: 44/44 pass.

Localization (Track B)

Can the method identify which edge carries the critical mismatch? On real-MCP families, the structural method finds the worst edge 77–87% of the time at P@1.

Macro-averaged Precision@K across 7 families.

MethodP@1P@3P@5
structural_sheaf0.610.580.48
cycle_weighted0.430.490.50
edge_distance0.270.340.37
cycle_plain0.000.000.00

On real-MCP families, the structural method identifies the worst edge at P@1 = 0.77–0.87, outperforming simple baselines by 43%.

Budgeted Repair (Track C)

Operational result

The repair advantage is strongest at low budgets, where selecting the wrong first repair is most costly. At K=1, structural prescriptions reduce failure 41–51% more than the best alternative. At K=8, methods converge. In practice, teams rarely have budget to rework every interface in a composition. First-repair quality is the practical test of any diagnostic.

TrackMethodK=1K=3K=5K=8
Bronze+structural_sheaf+11.3%+29.2%+42.0%+60.1%
cycle_plain+8.0%+23.8%+39.0%+55.0%
Silverstructural_sheaf+11.2%+44.3%+69.3%+83.5%
cycle_plain+7.7%+41.9%+67.6%+82.4%

Convergence at high budget is expected: when you can fix every interface, sequencing is irrelevant. The real operating point is K=1–3 — you ship a patch window, not a rewrite. There, first-repair quality determines whether the hotfix resolves the incident or merely reshuffles the failure. The structural diagnostic's 41–51% advantage at K=1 is a triage advantage at the only budget that matters in production.

Reproduce and Submit

Run the Diagnostic Tool

pip install bulla bulla diagnose --examples # 9 bundled compositions bulla manifest --from-json tools.json # generate convention manifests bulla check compositions/ --max-blind-spots 0 # CI gate

The structural diagnostic that achieves 100% recall in Experiment 3 is available as a standalone tool. Single dependency (PyYAML), exact rational arithmetic, SARIF output for GitHub Code Scanning integration. Source: github.com/jkomkov/bulla

Start — Feel It, Then Score It

cd benchmark/coherence-gym pip install -e . python -m coherence_gym demo python -m coherence_gym show invoice/medium/dev_000 python -m coherence_gym export-challenge-template --split dev -o preds.jsonl # Fill in severity per line, then: python -m coherence_gym score-predictions --predictions preds.jsonl --split dev

Run the telephone-game demo, inspect a real instance, then score JSONL predictions on the public split (Track A/B/C — same metrics as evaluate-method).

Reproduce the Benchmark in 60 Seconds

# From the benchmark directory: pip install -e . pytest tests/test_smoke.py -v python -m coherence_gym evaluate --split dev --output-dir results_check

Runs in approximately 9 seconds on a modern laptop. The output is a deterministic JSON file you can diff against the canonical results in results_canonical/. No API keys, no Docker, no configuration.

For the real-MCP tracks:

pip install -e ".[real-mcp]" python run_bronze_plus.py # Calendar (Bronze+) python run_silver.py # Invoice (Silver)

Submit a Challenger Method

Implement the DiagnosticMethod interface:

from coherence_gym.protocol import DiagnosticMethod, DiagnosticOutput class YourMethod(DiagnosticMethod): @property def name(self) -> str: return "your_method_name" def diagnose(self, graph, budget=None) -> DiagnosticOutput: score = ... # predicted severity (higher = worse) targets = ... # [(src_id, dst_id, field_name)] repair list return DiagnosticOutput( instance_id="", failure_score=score, repair_targets=targets, )
python -m coherence_gym evaluate-method --method your_method --split dev

This produces a schema-compliant record with Track A (R², Spearman ρ), Track B (P@1/3/5), and Track C (repair frontier at K=1,2,3,5,8). Submit your method code plus the public-split artifact for maintainer review and hidden-split evaluation.

Adversarial track

The adversarial track is for methods or instances that beat the reference diagnostic, break it honestly, or show the benchmark advantage is an artifact. Accepted submissions ship in future benchmark versions; v0.1 leaderboard tables stay frozen for comparability.

Rules

  1. 1Graph information only. Use any information from the composition graph — topology, conventions, field schemas. No ground truth access during diagnosis.
  2. 2Full public split. Report on the complete dev or test split. No cherry-picked subsets.
  3. 3All three tracks. Submit Track A, B, and C.
  4. 4Deterministic. Reproducible results. Document wall-clock time per instance.
  5. 5Offline hidden eval. No network, no APIs, no remote services during official evaluation.
  6. 6Hidden split is the official ranking. Public splits are for development and reproducibility.
  7. 7Frozen budgets. Repair levels K = 1, 2, 3, 5, 8. These cannot be redefined for v0.1.

Current Limits and Open Lanes

BABEL is intended as a serious benchmark, not a closed theory of interoperability.

Benchmark limitations

What BABEL does not yet capture.

  1. 1Lossy Tier 3 abstractions. Real API and MCP compositions involve richer failure modes than the current six-dimensional convention projections capture. Idempotency semantics, pagination cursors, and callback signing conventions are still outside scope.
  2. 2Structurally related ground truth. Parts of the benchmark evaluate predictive consistency against a ground truth that shares mathematical roots with the diagnostic. That circularity is acknowledged and is why the live measured-loss validation matters disproportionately as external evidence.

Evidence limitations

What has not been externally validated enough.

  1. 1Measured-loss scope expanding. The live-pipeline validation now spans two workflow families (Invoice: 75 runs, 25 configs; Calendar: 192 runs, 64 configs) including a genuinely cyclic 4-server composition. Independent replication remains the decisive next step.
  2. 2No external reruns yet. All results to date are first-party. The benchmark is designed for outside replication, but no independent group has yet run the full evaluation.

Open lanes

What others could productively build.

  1. 1Stronger non-structural challengers. Graph neural methods, spectral approaches, combinatorial optimization, and tool-augmented LLM strategies are all plausible contenders that have not yet been tested.
  2. 2Broader live validations. More workflow families, more MCP/API convention surfaces, and measured downstream harm beyond dollar loss (latency, misfires, compliance violations).
  3. 3Extended convention coverage. Richer projections beyond the current six dimensions to capture idempotency, pagination, callback signing, and authentication convention mismatches.

How to engage

These are not disclaimers. They define the current frontier of the benchmark.