Frontier

Interpretability Frontier

Can Mechanistic Interpretability Substitute for Structural Diagnosis?

Design-onlyPaper design and hypotheses complete. Cross-model replication planned (GPT-2 Small, Gemma 2 2B). No empirical results.
Falsification

Probing classifiers achieving compositional signal (not just edge-local accuracy)

Abstract

Tests whether mechanistic interpretability (SAE, probing, circuit tracing) can substitute for structural diagnosis on cyclic compositional failure.