Implementation of an Attention layer where each head can attend to more than just one token, using coordinate descent to pick topk - View it on GitHub
Star
46
Rank
459477