Research Program

Claim Ledger

Every major claim in the research program, with its current status, scope, and the conditions under which it would be wrong. A research program that does not subject itself to its own critique expects belief, not evaluation.

11 settled2 conditional5 empirical4 demonstrated1 conjectural23 claims total
Bulla Core

The coherence fee, witness Gram matrix, and the structural theorems that make the diagnostic instrument work.

Additive fee decomposition over semantic dimensions

Settled

The coherence fee decomposes additively over named semantic dimensions: fee(G) = Σ feeσ(G), exactly, under the Dimension-Field Disjointness hypothesis. The per-dimension fee profile is a vector-valued severity invariant whose ℓ¹ norm is the total fee.

Verified on 1,596 compositions (full pairwise closure of 57-manifest registry) and 703 real-schema compositions. One-line matroid-direct-sum proof under DFD.

What would falsify this

A counterexample composition in which two distinct dimension names share a (tool, field) reference and the additive decomposition fails.

Column-matroid backbone theorem

Settled

For every Bulla composition G, fee(G) = corank of the observable column set in the linear matroid on the columns of the full coboundary matrix. The fee is a matroid corank, not merely a rank difference.

Structural identity verified on 703 compositions. Seven load-bearing corollaries cover repair filtrations, leverage conservation, the rank-H¹ identity, and termination bounds.

What would falsify this

A kernel-level change to the coboundary construction that breaks the column-compression identity. Not possible under the current implementation.

Rank-H¹ cohomological gap identity

Settled

The fee equals the difference of cokernel dimensions: fee(G) = dim H¹(δ_obs) − dim H¹(δ_full). Proven both on paper (two-line rank-nullity argument) and in Lean 4 against Mathlib, with no sorry and standard axioms only.

Requires column-compression hypothesis and finite-dimensional ℚ-coefficients. Both hold for every Bulla composition under the current coboundary construction.

What would falsify this

A future Mathlib API change that invalidates the proof, or a refactor that breaks the shared-codomain hypothesis. Vanishingly unlikely given standard-axiom dependency.

Witness Gram rank identity and leverage conservation

Settled

rank K(G) = fee(G) on all compositions, and Σ ℓⱼ = fee(G) (leverage conservation). Verified in exact rational arithmetic on 703 compositions with zero violations.

Lean-formalized: witness_gram_rank_eq_fee and leverage_conservation theorems, both Aristotle-closed.

What would falsify this

Any corpus composition where rank K(G) ≠ fee(G) or leverage sums diverge from the fee.

Kron reduction under DFD

Conditional

Under Dimension-Field Disjointness, the witness Gram block factors as the Schur complement of the dimension-carrier-graph Laplacian with observable vertices grounded. The leverage prediction matches observed values exactly on 240 non-trivial compositions.

Empirically 240/240 under DFD. Lean formalization of the Kron-reduction theorem deferred.

What would falsify this

A corpus composition where the Schur-complement leverage prediction fails to match the computed leverage, under DFD.

Unilateral-bilateral fee decomposition

Empirical

The pairwise fee admits an exact decomposition: fee(A, B) = U(A) + U(B) + B(A, B), where U(X) is a per-server invariant computable from X’s schema alone. On 703 compositions, 96.4% have B = 0.

703-composition real-schema corpus. 7 of 38 servers are unilateral carriers. Algorithmic consequence: O(N) precomputation replaces O(N²) full sweep.

What would falsify this

Any fee(A, B) < U(A) + U(B), or bilateral corrections growing unboundedly, or carrier ratio outside [5%, 50%] on a second corpus.

SCPI (Sheaf Constraints on Predicate Invention)

The formal spine: predicate invention under sheaf constraints decomposes into distinct topological, conservativity, and definability gates.

Three-gate decomposition of predicate invention

Settled

Predicate invention under sheaf constraints decomposes into distinct topological, conservativity, and definability gates. The obstruction to gluing is borne by an H¹-class under the paper’s descent setup.

Finite-site / spine regime. Stated hypotheses still matter at the perimeter. Lean formalized in SCPI/Torsor.lean and SCPI/Conservativity.lean.

What would falsify this

A counterexample in the stated spine regime where the gates collapse or fail to separate diagnostically.

Schema discovery functorial universality

Conditional

Schema discovery admits the stronger functorial and universality surface sketched in the paper and Lean statements.

Depends on SchemaDiscovery.lean closure and stronger categorical proof burden.

What would falsify this

Failure to prove the stated adjunction, or a counterexample to the proposed universal behavior.

Bridge (Bilateral Validity ≠ Global Coherence)

The empirical foundation: bilateral validity can coexist with cycle failure in real compositions.

Bilateral-pass / cycle-fail coexistence

Empirical

Bilateral validity can coexist with cycle failure in real compositions. The obstruction dimension predicts the minimum typed bridge burden required for repair.

Strongest in the A05 minimality witness and nearby cases. Two live non-isomorphic β₁=2 families with selective repair.

What would falsify this

Expanded benchmarks show bilateral-pass / cycle-fail cases are rare, unstable, or reducible to ordinary bilateral errors.

Fee-aware procurement reduces regret

Empirical

Among functionally matched candidates that all pass local validation, fee-aware qualification reduces downstream procurement regret under externally anchored holdout cases, without using post-reveal outcomes as input to selection.

Finance consequence artifact: six holdout cases across four consequence types. Fee-aware rule reduces regret from 57.23 to 33.96, outperforming local baseline and naive checklist rival. Still synthetic in construction.

What would falsify this

The fee-aware rule fails to improve regret against either baseline under a materially different externally anchored holdout regime.

Lean Formalizations

Machine-verified theorems bridging the paper-math to Lean 4 against Mathlib.

Algorithm 1 completeness and soundness (TCP paper)

Settled

Spanning-tree BFS diagnostic for H¹ vanishing, formalized in Lean 4. The corrected theorem requires a spanning hypothesis (every edge’s endpoints are tree-connected). The pre-correction statement was falsified by an explicit counterexample.

About SpanningTreeDecomp (sub-class with h_span hypothesis), not TreeDecomp in general. Closed against Lean v4.28.0 + Mathlib.

What would falsify this

The corrected alg1_complete is shown to have a counterexample (vanishingly unlikely given 60 lines of Lean proof with explicit axiom dependency).

Tree-characterization theorem for H¹ vanishing

Settled

For a connected graph satisfying nV − 1 ≤ nE, the condition nE − (nV − 1) = 0 iff nE + 1 = nV. The v1 statement was correctly negated by the proof assistant with an explicit counterexample; the corrected statement adds the spanning hypothesis.

What would falsify this

A future Mathlib API change that breaks the omega tactic used in the proof.

BullaCorollary6 rank-H¹ identity in Lean

Settled

Three theorems formalizing the cohomological gap identity for Bulla compositions: fee_h1_additive (unconditional), fee_eq_h1_difference (under column compression), and range_obs_le_range_full (supporting lemma). All closed with no sorry and standard axioms only.

What would falsify this

The Lean solution file fails re-verification against a future Mathlib snapshot, or the standard-axiom dependency check fails (vanishingly unlikely).

Witness Gram rank and leverage conservation in Lean

Settled

The regime-independent rank identity rank K(G) = fee(G) and the trace-of-idempotent = rank theorem (foundation of leverage conservation), both formalized and closed.

What would falsify this

A Mathlib API change to the lemma names used in the proofs.

BABEL Benchmark

A public, reproducible benchmark demonstrating that frontier LLMs cannot rank compositions by failure severity.

Frontier LLMs fail to rank compositions by failure severity

Demonstrated

Five frontier LLMs (GPT-4o, Claude Sonnet 4, Opus 4, Codex 5.2, Gemini 3.1 Pro) achieve negative R² and near-zero Spearman ρ on composition-failure ranking under standard and guided-CoT prompting.

5 models, 3 providers, N=18–50 per model. 932 instances, 7 families, 3 provenance tiers.

What would falsify this

A frontier LLM achieves positive R² on BABEL under a reasonable prompting setup, or a tool-augmented LLM matches the structural diagnostic.

Oracle CoT decomposes failure into informational and arithmetic components

Demonstrated

Pre-computed restriction matrices with explicit arithmetic instructions at N=50 across all 5 models. Claude achieves ρ = 0.80, GPT-4o ρ = 0.39, while R² remains deeply negative for all.

What would falsify this

A model achieves both positive R² and high ρ under oracle conditions, closing the arithmetic gap.

Structural repair dominates sheaf-free alternatives at constrained budget

Demonstrated

At repair budget K=1, structural prescriptions achieve 11–29% improvement vs. 8–26% for alternatives. At K=3, structural leads on every family. Methods converge only at K=8.

932 instances, deterministic symbolic executor.

What would falsify this

A simpler method matches or exceeds structural repair at K=1–3 across all 7 families.

Local-Global Obstruction

The impossibility theorem: no bounded-depth audit can detect global incoherence above a critical composition size.

Local validity does not compose

Settled

For every depth bound d, there exist arbitrarily large compositions that are locally d-valid yet globally incoherent. The separation is spectral: above a critical composition size, the connection Laplacian’s spectral gap closes and bounded probes become indistinguishable from null.

Theorem A (graph separation), Theorem B (abstract version). Empirical spectral experiment on 500 graphs across 7 scales confirms the spectral indistinguishability finding.

What would falsify this

A bounded-depth testing procedure that maintains positive R² as composition size grows past n = 50.

Compositional Hiddenness

LLM access to the non-local hiddenness predicate is representation-conditioned, not a stable competence.

Hiddenness is a graph-relative predicate

Settled

Whether a hidden field is a blind spot depends on the composition partner, not the field alone. Constructive proof: the same file_reader tool has fee=1 (blind spot on path) when composed with data_loader, but fee=0 (no blind spot) when composed with event_logger. Bulla computes this exactly from schemas.

Theorems 1–2 with constructive proof on 8-composition synthetic ecology. Ground truth independently validated: 1.000 precision, 0.985 recall on 20-composition stratified sample (573 predictions, 0 false positives, 9 missed homonym collisions).

What would falsify this

A tool whose blind-spot status is determinable from its schema alone, without knowledge of the composition partner.

LLM access to hiddenness is representation-conditioned

Empirical

Frontier LLMs recover the non-local hiddenness predicate with 88% accuracy under structured relational prompts but collapse to 46% under flat representations, reverting to lexical heuristics with timestamp as the dominant false positive. Two independent factors gate the effect: relational framing (+33pp) and convention-specific task language (+33pp). Server names and tool descriptions contribute negligibly (−8pp).

Synthetic ecology: 8 compositions × 3 conditions × 3 repeats (Claude Sonnet 4). Context ablation: 4 conditions × 12 filesystem compositions. April 2026.

What would falsify this

A flat prompt format achieving comparable accuracy to the structured prompt, or a model maintaining exact computation across all prompt conditions without relational framing.

Cross-model divergence on identical compositions

Empirical

On the same synthetic ecology benchmark, Claude Sonnet 4 achieves higher precision (55% vs 17%) but loses recall under flat prompts (50%), while GPT-4o maintains 100% recall but produces 3× more false positives. Different training distributions produce different precision/robustness tradeoffs on the same non-local predicate. Neither model achieves exact computation.

Synthetic ecology cross-model replication: 8 compositions × 3 conditions × 3 repeats per model. April 2026.

What would falsify this

A frontier LLM achieving exact match scores comparable to Bulla (100%) across all prompt conditions and composition graphs.

Frontier

Open conjectures and design-stage research with nontrivial evidence but no closure.

H¹-style obstruction diagnostics for heterogeneous agent agreement

Conjectural

The settled core (coherence fee, matroid backbone, SCPI gates) extends to heterogeneous agent agreement via enriched and distributed claims. Enriched Laplacian-cohomology bridge and quantale-enriched H¹ are real frontier targets.

What would falsify this

The frontier examples fail to preserve the obstruction logic even in the implemented abelian or benchmarked surfaces.

Mechanistic interpretability is incomplete for cyclic failure prediction

Demonstrated

Edge-local mechanistic interpretability (SAE, probing, CKA, attention) cannot detect cyclic compositional failures that require global topological information. Structural diagnostic achieves ρ=1.0 in every condition; best interpretability baseline never exceeds ρ=0.758. Cross-model replication on GPT-2 Small and Gemma 2 2B.

240+ compositions, two open-weight models (GPT-2 Small, Gemma 2 2B). April 2026.

What would falsify this

Probing classifiers achieving compositional signal (not just edge-local accuracy), or the gap vanishing after giving pairwise methods mild graph aggregation.

About this ledger

Status classes follow the program's epistemic hierarchy: Settled means strongly supported by proof, formalization, or repeated artifact evidence. Conditional means supported under explicit assumptions that remain load-bearing. Empirical means supported by experiment or benchmark but not yet elevated to theorem-level generality. Demonstrated means shown in a public benchmark with reproducible protocol. Conjectural means a real research target with nontrivial evidence but no closure. The full internal ledger with SHA-256 artifact hashes, Lean proof references, and detailed audit trails is maintained in the repository.