Interpretability Frontier
Can Mechanistic Interpretability Substitute for Structural Diagnosis?
Key result
Edge-local interpretability is provably incomplete for cyclic compositional failure — the gap: the best interpretability baseline never exceeds ρ=0.758, while the structural diagnostic scores ρ=1.0 by construction (it computes the ground-truth holonomy)
Falsification
Probing classifiers achieving compositional signal (not just edge-local accuracy)
Abstract
Tests whether mechanistic interpretability (SAE, probing, circuit tracing) can substitute for structural diagnosis on cyclic compositional failure.