Mechanistic Interpretability (MI) is a subfield of AI alignment and safety research focused on reverse-engineering neural networks to understand their internal computational mechanisms by discovering the actual algorithms and circuits they learn. -
View it on GitHub