Empirical

BABEL

Benchmark for Autonomous Bridge Evaluation and Localization

FrozenPublic artifact (v0.1.0, 2026-03-24). 932 instances across 7 families and 3 provenance tiers; frozen evaluation protocol. Two live validations: 5-cluster invoice (N=75) and 4-server cyclic calendar (β₁≥3, N=192).
Key result

Live-pipeline: invoice (N=75) holonomy and distance both ρ=0.423; cyclic calendar (N=192, β₁≥3) shows 7.5× PI separation, but on composite prediction convention distance (ρ=0.39) outpredicts holonomy (ρ=0.14) — holonomy captures only temporal-error sub-structure. Same construction-before-measurement caveat both cases: topological awareness is the key ingredient; the algebraic apparatus adds at most narrow dimension-specific value.

Falsification

A submission achieving positive R² on Track A without using topological features

Abstract

Compositional semantic failure — locally correct systems composing into globally wrong outcomes — has caused billions in documented losses across six domains. BABEL is the first public benchmark for this failure mode: 932 instances, three evaluation tracks, five frontier LLMs that cannot solve it.