Frontier

Interpretability Frontier

Can Mechanistic Interpretability Substitute for Structural Diagnosis?

Demonstrated240+ compositions, GPT-2 Small + Gemma 2 2B cross-model replication. Structural diagnostic ρ=1.0; best interpretability baseline 0.758.
Key result

Edge-local interpretability is provably incomplete for cyclic compositional failure — structural diagnostic achieves ρ=1.0 in every condition; best interpretability baseline never exceeds ρ=0.758

Falsification

Probing classifiers achieving compositional signal (not just edge-local accuracy)

Abstract

Tests whether mechanistic interpretability (SAE, probing, circuit tracing) can substitute for structural diagnosis on cyclic compositional failure.