Technical Spine Omnibus

Reading Guide

Eight papers, one argument — from impossibility theorem to public benchmark

Papers

NameRolePart
SCPIFormal hinge: the impossibility theoremI
BridgeEmpirical witness: the obstruction appears in concrete systemsII
SeamProtocol consequence: what a serious response requiresIII
SHEAFFrontier extension: enriched cohomology and scaling theoryIV
Coherence CliffScaling evidence: regime change as composition growsIV
BABELPublic benchmark: 932 instances, 7 families, 3 tracksV
Interpretability FrontierRepresentational boundary: edge-local interpretability is not enoughVI

Reading Guide

This volume can be read in three different ways.

The first is architectural: the logical structure of the argument.

  1. SCPI for the formal hinge: bilateral checks cannot see cycles.
  2. Bridge for the empirical witness: LLMs cannot see them either.
  3. Seam for the protocol consequence: and you can prove they didn't.
  4. SHEAF for the frontier extension: and here is what resolution costs.
  5. The Coherence Cliff and Communication Bottleneck: and the problem grows worse at scale.
  6. BABEL, Bronze+, and Silver: and here is how to measure all of it.
  7. The Interpretability Frontier: and even the best per-component tools cannot substitute for structural diagnosis.

The second is evidence-first: start with what you can run, then ask why it works.

  1. BABEL (Part V) for the benchmark: 932 instances, 7 families, 3 tracks. Five frontier LLMs fail. The oracle CoT experiment decomposes why.
  2. The Coherence Cliff (Part IV) for the scaling evidence: the regime change where bounded-depth testing collapses.
  3. The Interpretability Frontier (Part VI) for the representational boundary: mechanistic interpretability at its best still cannot diagnose cyclic compositional failure.
  4. SCPI (Part I) for the formal foundation: the obstruction is topological and the proof is machine-checked.
  5. Seam (Part III) for the protocol response: what to build once the obstruction is granted.

The third is evidentiary: inspect the primary objects directly.

  1. Read the editorial and part notes first.
  2. Read the included papers as facsimiles of the primary technical objects.
  3. Use the appendices to inspect proof status, experimental design, protocol artifacts, and frontier notes without losing the hierarchy of the main volume.

Two distinctions apply throughout:

  • Settled core vs frontier.
  • Paper material vs appendix burden.

The first three parts constitute the hard center of the present program. Part IV is included because it is the visible perimeter of the same argument, not because it is already closed to the same degree. Within Part IV, the Linear Communication Bottleneck Theorem appears as a separate facsimile after the SHEAF paper---the first result harvested from the frontier that has passed an autonomy test and can be read independently of the surrounding program. The Coherence Cliff follows as a scaling experiment that provides the program's current large-scale empirical evidence for the necessity of sheaf-cohomological diagnostics: a regime change where the predictive gap between the best sheaf diagnostic and the best conventional baseline nearly triples from 5 to 50 agents.

Part V is the strongest section of the omnibus in terms of external artifact weight. It presents BABEL, the public benchmark (932 instances across 7 workflow families and 3 provenance tiers) that operationalizes the central claim into a public instrument with three evaluation tracks: failure prediction, failure localization, and budgeted repair. Bronze+ is a mixed-provenance MCP composition where an official reference server participates in a workflow that is protocol-green and semantics-red. Silver extends to invoice/settlement with two non-house servers (MarkItDown + Memory). The structural diagnostic achieves R² ≥ 0.86 across all seven families. Five frontier LLMs fail to rank compositions by severity (ρ near zero); an oracle CoT experiment across all five models isolates the bottleneck as arithmetic reasoning rather than information extraction --- Claude Sonnet 4 achieves ρ = 0.80 with oracle matrices but R² = -4.0, and Opus 4 shows the largest information-extraction delta (Δρ = +0.70). The live-pipeline validation (BABEL Section 6.6) provides the first fully non-circular result: structural holonomy correlates with measured dollar error from the actual MCP server pipeline (ρ = 0.795, p < 7.4 × 10⁻⁷).

Part VI is the capstone. The Interpretability Frontier paper tests whether mechanistic interpretability---the most powerful per-component diagnostic technology available---can substitute for structural diagnosis on cyclic compositional failure. Across 240+ compositions, three domains, four scales, two model architectures (GPT-2 Small and Gemma 2 2B), and six interpretability baseline families (including a cycle-oracle aggregator with explicit knowledge of graph topology), the answer is no. The structural diagnostic achieves ρ = 1.0 in every condition; the best interpretability baseline never exceeds ρ = 0.758. The decisive result: probing classifiers achieve 99.8% accuracy at every edge and carry zero global signal, and cycle-oracle aggregation adds nothing over edge-local averaging. Cross-model replication on Gemma 2 2B (a 20× larger model from a different architecture family) shows the gap widens rather than narrows with model scale. The gap is representational, not aggregational. This completes the argument: the formal theory predicts local blindness, the benchmark measures it, and now the interpretability paper shows that even the richest local tools inherit it --- across architectures.