BABEL
Benchmark for Autonomous Bridge Assessment and Localization Evaluation
As agent compositions scale, a new failure mode emerges: every component passes local validation — schema checks, type checks, pairwise tests — but the end-to-end output is semantically wrong. Settlement dates shift by one business day. Escalations fire thirty minutes early. Fees are computed on the wrong base. The error is systematic, subtle, and invisible to current evaluation methods.
Start with live validation and the Oracle condition below. The remaining sections explain the benchmark surface, results, and how to reproduce or challenge it.
Live Validation Against Measured Loss
Live-pipeline validation: structural holonomy and convention distance both correlate with measured dollar loss at ρ = 0.423 (p < 1.5 × 10⁻⁴, N = 75). Holonomy explains more linear variance (R² = 0.119 vs 0.070). On a cyclic 4-server calendar pipeline (β₁ ≥ 3, N = 192), convention mismatches produce 7.5× higher pipeline inconsistency.
Invoice: 5 clusters, N = 75, p < 1.5 × 10⁻⁴. Convention distance achieves the same rank correlation. Calendar: 4 servers, 64 configs, N = 192. Distance stronger on composite (ρ = 0.39); holonomy captures temporal error distance misses (ρ = 0.14 vs 0.02). Replication pack available on request
The central empirical question for any benchmark of this kind is whether the severity signal tracks real downstream error in live compositions — not just internal benchmark structure.
In the live-pipeline validation, parameterized convention assignments were run through actual MCP server pipelines and the resulting dollar errors were measured directly. Matched conventions produced zero error; mismatched conventions produced errors from 0.99× to 9,999×. The structural diagnostic predicted the severity ordering across these configurations.
A second cyclic validation on a 4-server calendar composition (β₁ ≥ 3 independent cycles, N = 192) confirms convention mismatches produce 7.5× higher pipeline inconsistency in cyclic compositions. Holonomy captures temporal-error structure that simple convention distance misses (ρ = 0.14, p = 0.048 vs ρ = 0.02, p = 0.81).
This is the strongest current evidence that the benchmark’s severity signal connects to measured operational harm across both acyclic and cyclic compositions. Broader validation across more workflows and independent replication remain the decisive next step.
What the Oracle Condition Shows
When cycle structure and restriction matrices are provided, rank-order judgment improves substantially for some models. Calibrated severity prediction remains weak for all. The failure decomposes into structure recovery and magnitude computation.
Five frontier models across three providers were evaluated under standard prompting and all achieved negative R² — worse than predicting the mean. That headline is easy to misread as a generic claim that language models fail at this task.
The Oracle condition changes the interpretation. When pre-computed cycle structures and restriction matrices are supplied, several models show substantially improved ranking performance while magnitude prediction remains broken. This suggests the failure is not monolithic: part of the problem is recovering the relevant compositional structure from the workflow description, and part is reasoning accurately about the severity induced by that structure.
| Model | Standard ρ | Oracle ρ | Δρ | Oracle R² |
|---|---|---|---|---|
| Claude Sonnet 4 | 0.12 | 0.80 | +0.68 | −4.0 |
| GPT-4o | 0.05 | 0.39 | +0.34 | −5.3 |
| Opus 4 | −0.35 | 0.35 | +0.70 | −0.42 |
| Codex 5.2 | −0.13 | 0.26 | +0.39 | −35.7 |
| Gemini 3.1 Pro | −0.11 | 0.04 | +0.15 | −129 |
All models evaluated at N=50 under Oracle condition. Standard condition: N=18–50 depending on cost. Codex 5.2 failed to produce parseable output 64% of the time under Oracle prompting.
The asymmetry is the finding: structure helps ranking, but calibration still breaks. That distinction matters for any downstream system that needs not just ordering but magnitude.
What BABEL Measures
BABEL targets the failure mode that survives all local checks: schema validation passes, type checking passes, pairwise contract testing passes, but the composed output is wrong because latent convention mismatches accumulate around cycles in the composition graph. Settlement date interpreted as initiation versus arrival. Fee computed on gross versus net. Timezone anchored to UTC versus organizer-local. ACT/365 versus 30/360 day-count conventions. Each bilateral handoff looks correct; the error closes only around the full loop.
The benchmark evaluates three capabilities that matter in practice:
Instances span three provenance tiers: controlled synthetic families (Tier 1), workflow-shaped and real-MCP families with actual servers communicating via stdio (Tier 2), and API-derived convention surfaces grounded in Stripe, Twilio, GitHub, Shopify, Slack, Plaid, QuickBooks, and others (Tier 3). The benchmark is designed to be beaten.
Leaderboard and Results
Severity Prediction (Track A)
The severity table shows where topological complexity separates methods. Simpler baselines approach the structural diagnostic on uniform instances but collapse on heterogeneous families.
R² with ground-truth holonomy per family. Dev split (514 instances). Higher is better.
| # | Method | Type | Synth | Invoice | Calendar | Policy | External |
|---|---|---|---|---|---|---|---|
| 1 | structural_sheaf | Reference | 0.993 | 0.978 | 0.958 | 0.979 | 1.000 |
| 2 | cycle_frustration_plain | Sheaf-free | 0.965 | 0.342 | 0.603 | 0.419 | 0.621 |
| 3 | weighted_graph_incon. | Sheaf-free | 0.783 | 0.120 | 0.554 | 0.216 | 0.616 |
| 4 | bounded_depth_8 | Conventional | 0.173 | 0.050 | 0.438 | 0.126 | 0.610 |
| 5 | graph_topology | Conventional | 0.152 | 0.127 | 0.251 | 0.000 | 0.625 |
Simpler cycle-aware methods approach the structural diagnostic on uniform synthetic instances (0.965 vs 0.993) but collapse on heterogeneous families (0.006–0.734 vs 0.86+). The gap emerges specifically with topological complexity.
Frontier LLM Baselines (Standard Prompting)
Standard prompting produces negative R² across all five frontier models — worse than predicting the mean. The Oracle condition above decomposes why.
| Model | Provider | R² | Spearman ρ | N |
|---|---|---|---|---|
| Claude Sonnet 4 | Anthropic | −134 | 0.12 | 50 |
| GPT-4o | OpenAI | −130 | 0.05 | 50 |
| Codex 5.2 | OpenAI | −134 | −0.13 | 18 |
| Claude Opus 4 | Anthropic | −133 | −0.35 | 20 |
| Gemini 3.1 Pro | −0.6 | −0.11 | 42 |
Live API calls via OpenRouter. All five models achieve negative R². Anthropic and OpenAI models overpredict (~0.7); Gemini underpredicts (0.0). See the Oracle condition above for the decomposition of this failure.
Real-MCP Results — Protocol-Green, Semantics-Red
Actual MCP servers communicating via stdio transport. All 44 protocol checks pass. Semantic failures persist. This is the benchmark’s operational core.
| Track | Servers | Sheaf R² | Best Conv. R² | K=1 Repair | K=8 Repair |
|---|---|---|---|---|---|
| Bronze+ (Calendar) | 3 custom + 1 official | 0.940 | 0.610 | +11.3% | +60.1% |
| Silver (Invoice) | 2 custom + 2 non-house | 0.861 | 0.734 | +11.2% | +83.5% |
Bronze+: 3 custom FastMCP servers + official Memory reference server. Silver: 2 custom + MarkItDown Docker MCP + official Memory server. Protocol checks: 44/44 pass.
Localization (Track B)
Can the method identify which edge carries the critical mismatch? On real-MCP families, the structural method finds the worst edge 77–87% of the time at P@1.
Macro-averaged Precision@K across 7 families.
| Method | P@1 | P@3 | P@5 |
|---|---|---|---|
| structural_sheaf | 0.61 | 0.58 | 0.48 |
| cycle_weighted | 0.43 | 0.49 | 0.50 |
| edge_distance | 0.27 | 0.34 | 0.37 |
| cycle_plain | 0.00 | 0.00 | 0.00 |
On real-MCP families, the structural method identifies the worst edge at P@1 = 0.77–0.87, outperforming simple baselines by 43%.
Budgeted Repair (Track C)
The repair advantage is strongest at low budgets, where selecting the wrong first repair is most costly. At K=1, structural prescriptions reduce failure 41–51% more than the best alternative. At K=8, methods converge. In practice, teams rarely have budget to rework every interface in a composition. First-repair quality is the practical test of any diagnostic.
| Track | Method | K=1 | K=3 | K=5 | K=8 |
|---|---|---|---|---|---|
| Bronze+ | structural_sheaf | +11.3% | +29.2% | +42.0% | +60.1% |
| cycle_plain | +8.0% | +23.8% | +39.0% | +55.0% | |
| Silver | structural_sheaf | +11.2% | +44.3% | +69.3% | +83.5% |
| cycle_plain | +7.7% | +41.9% | +67.6% | +82.4% |
That convergence at high budget is presented honestly: the structural advantage is a triage advantage, not a universal superiority claim. It matters when budget is scarce.
Reproduce and Submit
Reproduce in 60 Seconds
Runs in approximately 9 seconds on a modern laptop. The output is a deterministic JSON file you can diff against the canonical results in results_canonical/. No API keys, no Docker, no configuration.
For the real-MCP tracks:
Submit a Challenger Method
Implement the DiagnosticMethod interface:
This produces a schema-compliant record with Track A (R², Spearman ρ), Track B (P@1/3/5), and Track C (repair frontier at K=1,2,3,5,8). Submit your method code plus the public-split artifact for maintainer review and hidden-split evaluation.
Rules
- 1Graph information only. Use any information from the composition graph — topology, conventions, field schemas. No ground truth access during diagnosis.
- 2Full public split. Report on the complete dev or test split. No cherry-picked subsets.
- 3All three tracks. Submit Track A, B, and C.
- 4Deterministic. Reproducible results. Document wall-clock time per instance.
- 5Offline hidden eval. No network, no APIs, no remote services during official evaluation.
- 6Hidden split is the official ranking. Public splits are for development and reproducibility.
- 7Frozen budgets. Repair levels K = 1, 2, 3, 5, 8. These cannot be redefined for v0.1.
Current Limits and Open Lanes
BABEL is intended as a serious benchmark, not a closed theory of interoperability.
Benchmark limitations
What BABEL does not yet capture.
- 1Lossy Tier 3 abstractions. Real API and MCP compositions involve richer failure modes than the current six-dimensional convention projections capture. Idempotency semantics, pagination cursors, and callback signing conventions are still outside scope.
- 2Structurally related ground truth. Parts of the benchmark evaluate predictive consistency against a ground truth that shares mathematical roots with the diagnostic. That circularity is acknowledged and is why the live measured-loss validation matters disproportionately as external evidence.
Evidence limitations
What has not been externally validated enough.
- 1Measured-loss scope expanding. The live-pipeline validation now spans two workflow families (Invoice: 75 runs, 25 configs; Calendar: 192 runs, 64 configs) including a genuinely cyclic 4-server composition. Independent replication remains the decisive next step.
- 2No external reruns yet. All results to date are first-party. The benchmark is designed for outside replication, but no independent group has yet run the full evaluation.
Open lanes
What others could productively build.
- 1Stronger non-structural challengers. Graph neural methods, spectral approaches, combinatorial optimization, and tool-augmented LLM strategies are all plausible contenders that have not yet been tested.
- 2Broader live validations. More workflow families, more MCP/API convention surfaces, and measured downstream harm beyond dollar loss (latency, misfires, compliance violations).
- 3Extended convention coverage. Richer projections beyond the current six dimensions to capture idempotency, pagination, callback signing, and authentication convention mismatches.
How to engage
These are not disclaimers. They define the current frontier of the benchmark.