Some personal experiments around routing tokens to different autoregressive attention, akin to mixture-of-experts - View it on GitHub
Star
118
Rank
241166