Mechanistic Interpretability (MI) is a subfield of AI alignment and safety research focused on reverse-engineering neural networks to understand their internal computational mechanisms by discovering the actual algorithms and circuits they learn. - View it on GitHub
Star
1
Rank
5730283