Frontier

Interpretability Frontier

Can Mechanistic Interpretability Substitute for Structural Diagnosis?

Demonstrated240+ compositions, GPT-2 Small + Gemma 2 2B cross-model replication. Structural ρ=1.0 is true by construction (the diagnostic computes the ground-truth holonomy); the result is the gap vs interpretability baselines (best ρ=0.758), replicated on the 20× larger Gemma 2 2B.
Key result

Edge-local interpretability is provably incomplete for cyclic compositional failure — the gap: the best interpretability baseline never exceeds ρ=0.758, while the structural diagnostic scores ρ=1.0 by construction (it computes the ground-truth holonomy)

Falsification

Probing classifiers achieving compositional signal (not just edge-local accuracy)

Abstract

Tests whether mechanistic interpretability (SAE, probing, circuit tracing) can substitute for structural diagnosis on cyclic compositional failure.