Gitstar Ranking
Users
Organizations
Repositories
Rankings
Users
Organizations
Repositories
Sign in with GitHub
microsoft
Fetched on 2025/03/18 16:26
microsoft
/
MInference
[NeurIPS'24 Spotlight, ICLR'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy. -
View it on GitHub
https://aka.ms/MInference
Star
943
Rank
39186