Frontier

Interpretability Frontier

Can Mechanistic Interpretability Substitute for Structural Diagnosis?

Demonstrated240+ compositions, GPT-2 Small + Gemma 2 2B cross-model replication. Structural ρ=1.0 is true by construction (the diagnostic computes the ground-truth holonomy); the result is the gap vs interpretability baselines (best ρ=0.758), replicated on the 20× larger Gemma 2 2B.

Key result

Edge-local interpretability is provably incomplete for cyclic compositional failure — the gap: the best interpretability baseline never exceeds ρ=0.758, while the structural diagnostic scores ρ=1.0 by construction (it computes the ground-truth holonomy)

Falsification

Probing classifiers achieving compositional signal (not just edge-local accuracy)

Formal anchors

A10Witnessed Sameness A13The Sheaf Condition A30Scoped Equivalence A32The Third Mode

Abstract

Tests whether mechanistic interpretability (SAE, probing, circuit tracing) can substitute for structural diagnosis on cyclic compositional failure.

Abstract

Resources