An unofficial implementation of the σReparam from the "Stabilizing Transformer Training by Preventing Attention Entropy Collapse" paper - View it on GitHub
Star
4
Rank
2284145