Tiny pseudo-profiling python script that estimates KV cache memory and a rough latency budget for sizing a deployment. (inputs: context length, target tokens, batch size, layers/heads/dim, dtype) - View it on GitHub
Star
0
Rank
13844299