BABEL
Benchmark for Autonomous Bridge Evaluation and Localization
As agent compositions scale, a new failure mode emerges: every component passes local validation — schema checks, type checks, pairwise tests — but the end-to-end output is semantically wrong. Settlement dates shift by one business day. Escalations fire thirty minutes early. Fees are computed on the wrong base. The error is systematic, subtle, and invisible to current evaluation methods. Documented losses from this failure mode total billions in aggregate across aerospace, finance, payments, healthcare, software, and energy — from the $327M Mars Climate Orbiter to recurring 100× Stripe overcharges to $4B+ in LNG arbitration.
932 instances across 7 workflow families, 3 provenance tiers, 3 evaluation tracks. LLM and reasoning-model baselines in the leaderboard below.
Start with live validation and the Oracle condition below. The remaining sections explain the benchmark surface, results, and how to reproduce or challenge it.
Can Agents Self-Diagnose Compositional Risk?
We gave Claude Sonnet 4, GPT-4o, and Gemini 2.5 Pro the actual tool schemas — exactly what an orchestrating agent would see — and asked them to find convention risks. Aggregate recall: 25.6% vs the structural baseline of 100%. Under adversarial prompting, GPT-4o declared 100% of risky compositions safe.
51 compositions across three workflow families (invoice, calendar, policy), each with 7 tools and known convention mismatches. Three conditions: schema-only, with explicit dimension hints, and adversarial (“our team believes these tools are compatible”). 459 total evaluations.
| Model | Schema-only | With hints | Adversarial FA |
|---|---|---|---|
| Claude Sonnet 4 | 37.8% recall | 63.8% recall | 52.9% |
| GPT-4o | 12.5% recall | 67.7% recall | 100.0% |
| Gemini 2.5 Pro | 3.4% recall | 2.0% recall | 2.0% |
| Structural diagnostic | 100% | 100% | 0% |
Precision was 97.4% across all conditions — when models flag a risk, they are almost always correct. The failure mode is not hallucination; it is blindness. A 200-line structural analysis catches what frontier models miss. FA = false assurance rate (fraction of risky compositions declared safe).
Note on Gemini 2.5 Pro: Gemini's recall decreases with dimension hints (3.4% → 2.0%), which is counterintuitive. Investigation of the raw API responses reveals a thinking-model truncation pattern: Gemini allocates ~2,400 internal reasoning tokens per call, leaving little budget for visible output within the 1,500-token cap. Median visible response length was 4–7 characters. The model reasons extensively but the visible answer is truncated before substantive content is emitted. This is itself a finding: thinking models may produce worse observable outputs under constrained token budgets, a form of the coherence fee at the model level.
Live Validation Against Measured Loss
Live-pipeline validation: structural holonomy and convention distance both correlate with measured dollar loss at ρ = 0.423 (p < 1.5 × 10⁻⁴, N = 75). Holonomy explains more linear variance (R² = 0.119 vs 0.070). On a cyclic 4-server calendar pipeline (β₁ ≥ 3, N = 192), convention mismatches produce 7.5× higher pipeline inconsistency.
Invoice: 5 clusters, N = 75, p < 1.5 × 10⁻⁴. Convention distance achieves the same rank correlation. Calendar: 4 servers, 64 configs, N = 192. Distance stronger on composite (ρ = 0.39); holonomy captures temporal error distance misses (ρ = 0.14 vs 0.02). Replication pack available on request
The central empirical question for any benchmark of this kind is whether the severity signal tracks real downstream error in live compositions — not just internal benchmark structure.
In the live-pipeline validation, parameterized convention assignments were run through actual MCP server pipelines and the resulting dollar errors were measured directly. Matched conventions produced zero error; mismatched conventions produced errors from 0.99× to 9,999×. The structural diagnostic predicted the severity ordering across these configurations.
A second cyclic validation on a 4-server calendar composition (β₁ ≥ 3 independent cycles, N = 192) confirms convention mismatches produce 7.5× higher pipeline inconsistency in cyclic compositions. Holonomy captures temporal-error structure that simple convention distance misses (ρ = 0.14, p = 0.048 vs ρ = 0.02, p = 0.81).
This is the strongest current evidence that the benchmark’s severity signal connects to measured operational harm across both acyclic and cyclic compositions. Broader validation across more workflows and independent replication remain the decisive next step.
What the Oracle Condition Shows
When cycle structure and restriction matrices are provided, rank-order judgment improves substantially for some models. Calibrated severity prediction remains weak for all. The failure decomposes into structure recovery and magnitude computation.
Five frontier models across three providers were evaluated under standard prompting and all achieved negative R² — worse than predicting the mean. That headline is easy to misread as a generic claim that language models fail at this task.
The Oracle condition changes the interpretation. When pre-computed cycle structures and restriction matrices are supplied, several models show substantially improved ranking performance while magnitude prediction remains broken. This suggests the failure is not monolithic: part of the problem is recovering the relevant compositional structure from the workflow description, and part is reasoning accurately about the severity induced by that structure.
| Model | Standard ρ | Oracle ρ | Δρ | Oracle R² |
|---|---|---|---|---|
| Claude Sonnet 4 | 0.12 | 0.80 | +0.68 | −4.0 |
| GPT-4o | 0.05 | 0.39 | +0.34 | −5.3 |
| Opus 4 | −0.35 | 0.35 | +0.70 | −0.42 |
| Codex 5.2 | −0.13 | 0.26 | +0.39 | −35.7 |
| Gemini 3.1 Pro | −0.11 | 0.04 | +0.15 | −129 |
All models evaluated at N=50 under Oracle condition. Standard condition: N=18–50 depending on cost. Codex 5.2 failed to produce parseable output 64% of the time under Oracle prompting.
The asymmetry is the finding: structure helps ranking, but calibration still breaks. That distinction matters for any downstream system that needs not just ordering but magnitude.
What BABEL Measures
BABEL targets the failure mode that survives all local checks: schema validation passes, type checking passes, pairwise contract testing passes, but the composed output is wrong because latent convention mismatches accumulate around cycles in the composition graph. Settlement date interpreted as initiation versus arrival. Fee computed on gross versus net. Timezone anchored to UTC versus organizer-local. ACT/365 versus 30/360 day-count conventions. Each bilateral handoff looks correct; the error closes only around the full loop.
The benchmark evaluates three capabilities that matter in practice:
Instances span three provenance tiers: controlled synthetic families (Tier 1), workflow-shaped and real-MCP families with actual servers communicating via stdio (Tier 2), and API-derived convention surfaces grounded in Stripe, Twilio, GitHub, Shopify, Slack, Plaid, QuickBooks, and others (Tier 3). The benchmark is designed to be beaten.
Leaderboard and Results
Severity Prediction (Track A)
The severity table shows where topological complexity separates methods. Simpler baselines approach the structural diagnostic on uniform instances but collapse on heterogeneous families.
R² with ground-truth holonomy per family. Dev split (514 instances). Higher is better.
| # | Method | Type | Synth | Invoice | Calendar | Policy | External |
|---|---|---|---|---|---|---|---|
| 1 | structural_sheaf | Reference | 0.993 | 0.978 | 0.958 | 0.979 | 1.000 |
| 2 | cycle_frustration_plain | Sheaf-free | 0.965 | 0.342 | 0.603 | 0.419 | 0.621 |
| 3 | weighted_graph_incon. | Sheaf-free | 0.783 | 0.120 | 0.554 | 0.216 | 0.616 |
| 4 | bounded_depth_8 | Conventional | 0.173 | 0.050 | 0.438 | 0.126 | 0.610 |
| 5 | graph_topology | Conventional | 0.152 | 0.127 | 0.251 | 0.000 | 0.625 |
Simpler cycle-aware methods approach the structural diagnostic on uniform synthetic instances (0.965 vs 0.993) but collapse on heterogeneous families (0.006–0.734 vs 0.86+). The gap emerges specifically with topological complexity.
Frontier LLM Baselines (Standard Prompting)
Standard prompting produces negative R² across all five frontier models — worse than predicting the mean. The Oracle condition above decomposes why.
| Model | Provider | R² | Spearman ρ | N |
|---|---|---|---|---|
| Claude Sonnet 4 | Anthropic | −134 | 0.12 | 50 |
| GPT-4o | OpenAI | −130 | 0.05 | 50 |
| Codex 5.2 | OpenAI | −134 | −0.13 | 18 |
| Claude Opus 4 | Anthropic | −133 | −0.35 | 20 |
| Gemini 3.1 Pro | −0.6 | −0.11 | 42 |
Live API calls via OpenRouter. All five models achieve negative R². Anthropic and OpenAI models overpredict (~0.7); Gemini underpredicts (0.0). See the Oracle condition above for the decomposition of this failure.
Real-MCP Results — Protocol-Green, Semantics-Red
Actual MCP servers communicating via stdio transport. All 44 protocol checks pass. Semantic failures persist. This is the benchmark’s operational core.
| Track | Servers | Sheaf R² | Best Conv. R² | K=1 Repair | K=8 Repair |
|---|---|---|---|---|---|
| Bronze+ (Calendar) | 3 custom + 1 official | 0.940 | 0.610 | +11.3% | +60.1% |
| Silver (Invoice) | 2 custom + 2 non-house | 0.861 | 0.734 | +11.2% | +83.5% |
Bronze+: 3 custom FastMCP servers + official Memory reference server. Silver: 2 custom + MarkItDown Docker MCP + official Memory server. Protocol checks: 44/44 pass.
Localization (Track B)
Can the method identify which edge carries the critical mismatch? On real-MCP families, the structural method finds the worst edge 77–87% of the time at P@1.
Macro-averaged Precision@K across 7 families.
| Method | P@1 | P@3 | P@5 |
|---|---|---|---|
| structural_sheaf | 0.61 | 0.58 | 0.48 |
| cycle_weighted | 0.43 | 0.49 | 0.50 |
| edge_distance | 0.27 | 0.34 | 0.37 |
| cycle_plain | 0.00 | 0.00 | 0.00 |
On real-MCP families, the structural method identifies the worst edge at P@1 = 0.77–0.87, outperforming simple baselines by 43%.
Budgeted Repair (Track C)
The repair advantage is strongest at low budgets, where selecting the wrong first repair is most costly. At K=1, structural prescriptions reduce failure 41–51% more than the best alternative. At K=8, methods converge. In practice, teams rarely have budget to rework every interface in a composition. First-repair quality is the practical test of any diagnostic.
| Track | Method | K=1 | K=3 | K=5 | K=8 |
|---|---|---|---|---|---|
| Bronze+ | structural_sheaf | +11.3% | +29.2% | +42.0% | +60.1% |
| cycle_plain | +8.0% | +23.8% | +39.0% | +55.0% | |
| Silver | structural_sheaf | +11.2% | +44.3% | +69.3% | +83.5% |
| cycle_plain | +7.7% | +41.9% | +67.6% | +82.4% |
Convergence at high budget is expected: when you can fix every interface, sequencing is irrelevant. The real operating point is K=1–3 — you ship a patch window, not a rewrite. There, first-repair quality determines whether the hotfix resolves the incident or merely reshuffles the failure. The structural diagnostic's 41–51% advantage at K=1 is a triage advantage at the only budget that matters in production.
Reproduce and Submit
Run the Diagnostic Tool
The structural diagnostic that achieves 100% recall in Experiment 3 is available as a standalone tool. Single dependency (PyYAML), exact rational arithmetic, SARIF output for GitHub Code Scanning integration. Source: github.com/jkomkov/bulla
Start — Feel It, Then Score It
Run the telephone-game demo, inspect a real instance, then score JSONL predictions on the public split (Track A/B/C — same metrics as evaluate-method).
Reproduce the Benchmark in 60 Seconds
Runs in approximately 9 seconds on a modern laptop. The output is a deterministic JSON file you can diff against the canonical results in results_canonical/. No API keys, no Docker, no configuration.
For the real-MCP tracks:
Submit a Challenger Method
Implement the DiagnosticMethod interface:
This produces a schema-compliant record with Track A (R², Spearman ρ), Track B (P@1/3/5), and Track C (repair frontier at K=1,2,3,5,8). Submit your method code plus the public-split artifact for maintainer review and hidden-split evaluation.
The adversarial track is for methods or instances that beat the reference diagnostic, break it honestly, or show the benchmark advantage is an artifact. Accepted submissions ship in future benchmark versions; v0.1 leaderboard tables stay frozen for comparability.
Rules
- 1Graph information only. Use any information from the composition graph — topology, conventions, field schemas. No ground truth access during diagnosis.
- 2Full public split. Report on the complete dev or test split. No cherry-picked subsets.
- 3All three tracks. Submit Track A, B, and C.
- 4Deterministic. Reproducible results. Document wall-clock time per instance.
- 5Offline hidden eval. No network, no APIs, no remote services during official evaluation.
- 6Hidden split is the official ranking. Public splits are for development and reproducibility.
- 7Frozen budgets. Repair levels K = 1, 2, 3, 5, 8. These cannot be redefined for v0.1.
Current Limits and Open Lanes
BABEL is intended as a serious benchmark, not a closed theory of interoperability.
Benchmark limitations
What BABEL does not yet capture.
- 1Lossy Tier 3 abstractions. Real API and MCP compositions involve richer failure modes than the current six-dimensional convention projections capture. Idempotency semantics, pagination cursors, and callback signing conventions are still outside scope.
- 2Structurally related ground truth. Parts of the benchmark evaluate predictive consistency against a ground truth that shares mathematical roots with the diagnostic. That circularity is acknowledged and is why the live measured-loss validation matters disproportionately as external evidence.
Evidence limitations
What has not been externally validated enough.
- 1Measured-loss scope expanding. The live-pipeline validation now spans two workflow families (Invoice: 75 runs, 25 configs; Calendar: 192 runs, 64 configs) including a genuinely cyclic 4-server composition. Independent replication remains the decisive next step.
- 2No external reruns yet. All results to date are first-party. The benchmark is designed for outside replication, but no independent group has yet run the full evaluation.
Open lanes
What others could productively build.
- 1Stronger non-structural challengers. Graph neural methods, spectral approaches, combinatorial optimization, and tool-augmented LLM strategies are all plausible contenders that have not yet been tested.
- 2Broader live validations. More workflow families, more MCP/API convention surfaces, and measured downstream harm beyond dollar loss (latency, misfires, compliance violations).
- 3Extended convention coverage. Richer projections beyond the current six dimensions to capture idempotency, pagination, callback signing, and authentication convention mismatches.
How to engage
These are not disclaimers. They define the current frontier of the benchmark.