BABEL

Benchmark for Autonomous Bridge Assessment and Localization Evaluation

As agent compositions scale, a new failure mode emerges: every component passes local validation — schema checks, type checks, pairwise tests — but the end-to-end output is semantically wrong. Settlement dates shift by one business day. Escalations fire thirty minutes early. Fees are computed on the wrong base. The error is systematic, subtle, and invisible to current evaluation methods.

932instances
7workflow families
3provenance tiers
3evaluation tracks

Start with live validation and the Oracle condition below. The remaining sections explain the benchmark surface, results, and how to reproduce or challenge it.

Live Validation Against Measured Loss

Key result

Live-pipeline validation: structural holonomy and convention distance both correlate with measured dollar loss at ρ = 0.423 (p < 1.5 × 10⁻⁴, N = 75). Holonomy explains more linear variance (R² = 0.119 vs 0.070). On a cyclic 4-server calendar pipeline (β₁ ≥ 3, N = 192), convention mismatches produce 7.5× higher pipeline inconsistency.

Invoice: 5 clusters, N = 75, p < 1.5 × 10⁻⁴. Convention distance achieves the same rank correlation. Calendar: 4 servers, 64 configs, N = 192. Distance stronger on composite (ρ = 0.39); holonomy captures temporal error distance misses (ρ = 0.14 vs 0.02). Replication pack available on request

The central empirical question for any benchmark of this kind is whether the severity signal tracks real downstream error in live compositions — not just internal benchmark structure.

In the live-pipeline validation, parameterized convention assignments were run through actual MCP server pipelines and the resulting dollar errors were measured directly. Matched conventions produced zero error; mismatched conventions produced errors from 0.99× to 9,999×. The structural diagnostic predicted the severity ordering across these configurations.

A second cyclic validation on a 4-server calendar composition (β₁ ≥ 3 independent cycles, N = 192) confirms convention mismatches produce 7.5× higher pipeline inconsistency in cyclic compositions. Holonomy captures temporal-error structure that simple convention distance misses (ρ = 0.14, p = 0.048 vs ρ = 0.02, p = 0.81).

This is the strongest current evidence that the benchmark’s severity signal connects to measured operational harm across both acyclic and cyclic compositions. Broader validation across more workflows and independent replication remain the decisive next step.

What the Oracle Condition Shows

Key finding

When cycle structure and restriction matrices are provided, rank-order judgment improves substantially for some models. Calibrated severity prediction remains weak for all. The failure decomposes into structure recovery and magnitude computation.

Five frontier models across three providers were evaluated under standard prompting and all achieved negative R² — worse than predicting the mean. That headline is easy to misread as a generic claim that language models fail at this task.

The Oracle condition changes the interpretation. When pre-computed cycle structures and restriction matrices are supplied, several models show substantially improved ranking performance while magnitude prediction remains broken. This suggests the failure is not monolithic: part of the problem is recovering the relevant compositional structure from the workflow description, and part is reasoning accurately about the severity induced by that structure.

ModelStandard ρOracle ρΔρOracle R²
Claude Sonnet 40.120.80+0.68−4.0
GPT-4o0.050.39+0.34−5.3
Opus 4−0.350.35+0.70−0.42
Codex 5.2−0.130.26+0.39−35.7
Gemini 3.1 Pro−0.110.04+0.15−129

All models evaluated at N=50 under Oracle condition. Standard condition: N=18–50 depending on cost. Codex 5.2 failed to produce parseable output 64% of the time under Oracle prompting.

The asymmetry is the finding: structure helps ranking, but calibration still breaks. That distinction matters for any downstream system that needs not just ordering but magnitude.

What BABEL Measures

BABEL targets the failure mode that survives all local checks: schema validation passes, type checking passes, pairwise contract testing passes, but the composed output is wrong because latent convention mismatches accumulate around cycles in the composition graph. Settlement date interpreted as initiation versus arrival. Fee computed on gross versus net. Timezone anchored to UTC versus organizer-local. ACT/365 versus 30/360 day-count conventions. Each bilateral handoff looks correct; the error closes only around the full loop.

The benchmark evaluates three capabilities that matter in practice:

Track A: PredictionEstimate overall severity of semantic composition failure. Scored by Spearman ρ and R² with ground-truth holonomy.
Track B: LocalizationIdentify which edge carries the critical convention mismatch. Scored by Precision@K for the highest-frustration edges.
Track C: RepairUnder a fixed repair budget (K = 1, 2, 3, 5, 8), prescribe which edges to fix first. Scored by failure reduction.

Instances span three provenance tiers: controlled synthetic families (Tier 1), workflow-shaped and real-MCP families with actual servers communicating via stdio (Tier 2), and API-derived convention surfaces grounded in Stripe, Twilio, GitHub, Shopify, Slack, Plaid, QuickBooks, and others (Tier 3). The benchmark is designed to be beaten.

Leaderboard and Results

Severity Prediction (Track A)

The severity table shows where topological complexity separates methods. Simpler baselines approach the structural diagnostic on uniform instances but collapse on heterogeneous families.

R² with ground-truth holonomy per family. Dev split (514 instances). Higher is better.

#MethodTypeSynthInvoiceCalendarPolicyExternal
1structural_sheafReference0.9930.9780.9580.9791.000
2cycle_frustration_plainSheaf-free0.9650.3420.6030.4190.621
3weighted_graph_incon.Sheaf-free0.7830.1200.5540.2160.616
4bounded_depth_8Conventional0.1730.0500.4380.1260.610
5graph_topologyConventional0.1520.1270.2510.0000.625

Simpler cycle-aware methods approach the structural diagnostic on uniform synthetic instances (0.965 vs 0.993) but collapse on heterogeneous families (0.006–0.734 vs 0.86+). The gap emerges specifically with topological complexity.

Frontier LLM Baselines (Standard Prompting)

Standard prompting produces negative R² across all five frontier models — worse than predicting the mean. The Oracle condition above decomposes why.

ModelProviderSpearman ρN
Claude Sonnet 4Anthropic−1340.1250
GPT-4oOpenAI−1300.0550
Codex 5.2OpenAI−134−0.1318
Claude Opus 4Anthropic−133−0.3520
Gemini 3.1 ProGoogle−0.6−0.1142

Live API calls via OpenRouter. All five models achieve negative R². Anthropic and OpenAI models overpredict (~0.7); Gemini underpredicts (0.0). See the Oracle condition above for the decomposition of this failure.

Real-MCP Results — Protocol-Green, Semantics-Red

Actual MCP servers communicating via stdio transport. All 44 protocol checks pass. Semantic failures persist. This is the benchmark’s operational core.

TrackServersSheaf R²Best Conv. R²K=1 RepairK=8 Repair
Bronze+ (Calendar)3 custom + 1 official0.9400.610+11.3%+60.1%
Silver (Invoice)2 custom + 2 non-house0.8610.734+11.2%+83.5%

Bronze+: 3 custom FastMCP servers + official Memory reference server. Silver: 2 custom + MarkItDown Docker MCP + official Memory server. Protocol checks: 44/44 pass.

Localization (Track B)

Can the method identify which edge carries the critical mismatch? On real-MCP families, the structural method finds the worst edge 77–87% of the time at P@1.

Macro-averaged Precision@K across 7 families.

MethodP@1P@3P@5
structural_sheaf0.610.580.48
cycle_weighted0.430.490.50
edge_distance0.270.340.37
cycle_plain0.000.000.00

On real-MCP families, the structural method identifies the worst edge at P@1 = 0.77–0.87, outperforming simple baselines by 43%.

Budgeted Repair (Track C)

Operational result

The repair advantage is strongest at low budgets, where selecting the wrong first repair is most costly. At K=1, structural prescriptions reduce failure 41–51% more than the best alternative. At K=8, methods converge. In practice, teams rarely have budget to rework every interface in a composition. First-repair quality is the practical test of any diagnostic.

TrackMethodK=1K=3K=5K=8
Bronze+structural_sheaf+11.3%+29.2%+42.0%+60.1%
cycle_plain+8.0%+23.8%+39.0%+55.0%
Silverstructural_sheaf+11.2%+44.3%+69.3%+83.5%
cycle_plain+7.7%+41.9%+67.6%+82.4%

That convergence at high budget is presented honestly: the structural advantage is a triage advantage, not a universal superiority claim. It matters when budget is scarce.

Reproduce and Submit

Reproduce in 60 Seconds

# From the benchmark directory: pip install -e . pytest tests/test_smoke.py -v python -m coherence_gym evaluate --split dev --output-dir results_check

Runs in approximately 9 seconds on a modern laptop. The output is a deterministic JSON file you can diff against the canonical results in results_canonical/. No API keys, no Docker, no configuration.

For the real-MCP tracks:

pip install -e ".[real-mcp]" python run_bronze_plus.py # Calendar (Bronze+) python run_silver.py # Invoice (Silver)

Submit a Challenger Method

Implement the DiagnosticMethod interface:

from coherence_gym.protocol import DiagnosticMethod, DiagnosticOutput class YourMethod(DiagnosticMethod): @property def name(self) -> str: return "your_method_name" def diagnose(self, graph, budget=None) -> DiagnosticOutput: score = ... # predicted severity (higher = worse) targets = ... # [(src_id, dst_id, field_name)] repair list return DiagnosticOutput( instance_id="", failure_score=score, repair_targets=targets, )
python -m coherence_gym evaluate-method --method your_method --split dev

This produces a schema-compliant record with Track A (R², Spearman ρ), Track B (P@1/3/5), and Track C (repair frontier at K=1,2,3,5,8). Submit your method code plus the public-split artifact for maintainer review and hidden-split evaluation.

Rules

  1. 1Graph information only. Use any information from the composition graph — topology, conventions, field schemas. No ground truth access during diagnosis.
  2. 2Full public split. Report on the complete dev or test split. No cherry-picked subsets.
  3. 3All three tracks. Submit Track A, B, and C.
  4. 4Deterministic. Reproducible results. Document wall-clock time per instance.
  5. 5Offline hidden eval. No network, no APIs, no remote services during official evaluation.
  6. 6Hidden split is the official ranking. Public splits are for development and reproducibility.
  7. 7Frozen budgets. Repair levels K = 1, 2, 3, 5, 8. These cannot be redefined for v0.1.

Current Limits and Open Lanes

BABEL is intended as a serious benchmark, not a closed theory of interoperability.

Benchmark limitations

What BABEL does not yet capture.

  1. 1Lossy Tier 3 abstractions. Real API and MCP compositions involve richer failure modes than the current six-dimensional convention projections capture. Idempotency semantics, pagination cursors, and callback signing conventions are still outside scope.
  2. 2Structurally related ground truth. Parts of the benchmark evaluate predictive consistency against a ground truth that shares mathematical roots with the diagnostic. That circularity is acknowledged and is why the live measured-loss validation matters disproportionately as external evidence.

Evidence limitations

What has not been externally validated enough.

  1. 1Measured-loss scope expanding. The live-pipeline validation now spans two workflow families (Invoice: 75 runs, 25 configs; Calendar: 192 runs, 64 configs) including a genuinely cyclic 4-server composition. Independent replication remains the decisive next step.
  2. 2No external reruns yet. All results to date are first-party. The benchmark is designed for outside replication, but no independent group has yet run the full evaluation.

Open lanes

What others could productively build.

  1. 1Stronger non-structural challengers. Graph neural methods, spectral approaches, combinatorial optimization, and tool-augmented LLM strategies are all plausible contenders that have not yet been tested.
  2. 2Broader live validations. More workflow families, more MCP/API convention surfaces, and measured downstream harm beyond dollar loss (latency, misfires, compliance violations).
  3. 3Extended convention coverage. Richer projections beyond the current six dimensions to capture idempotency, pagination, callback signing, and authentication convention mismatches.

How to engage

These are not disclaimers. They define the current frontier of the benchmark.