LLM Performance Tools
Model FLOPs Utilization (MFU) Calculator
ℹ️ Assumptions*
Assumptions:
- Using dense (non-sparse) TFLOPS specs only
- Training uses 6N FLOPs per token (2N forward + 4N backward)
- Inference uses 2N FLOPs per token
- Does not account for memory bandwidth limitations
- Does not account for communication overhead
- Does not account for framework inefficiencies
- Does not account for Sliding Window Attention (SWA)
- Does not account for Grouped Query Attention (GQA)
Found an issue? Please report it to finbarrtimbers at google's email service dot com or file an issue on GitHub. Include the inputs you used, expected vs. actual results, and any error messages.
KV Cache & Batch Size Calculator
ℹ️ Assumptions*
Assumptions:
- KV cache stores both key and value tensors (factor of 2)
- Full attention layers cache the entire sequence
- Sliding window attention layers only cache the window size
- Model weights are sharded across TP degree GPUs
- KV cache is replicated across TP degree GPUs
- Does not account for activation memory during forward pass
- Does not account for framework overhead or memory fragmentation