Qwen/Qwen2.5-72B-Instruct on A100-80G
How many A100-80G GPUs to run Qwen/Qwen2.5-72B-Instruct.
Architecture
| Field |
Value |
model_type |
qwen2 |
attention |
GQA (heads=64, kv_heads=8, hd=128) |
sliding_window |
131072 |
Weights
| Field |
Value |
Label |
| safetensors bytes |
135.43 GB |
[verified] |
| params |
72.7B |
[estimated] |
| quantization |
BF16 [verified] |
|
Quantization reconciliation
| Scheme |
Predicted |
Δ |
Error |
| FP16 |
135.42 GB |
1.68 MB over |
0.0% |
| BF16 ✓ |
135.42 GB |
1.68 MB over |
0.0% |
| FP8 |
67.71 GB |
67.71 GB over |
100.0% |
| INT8 |
67.71 GB |
67.71 GB over |
100.0% |
| FP4_FP8_MIXED |
37.24 GB |
98.18 GB over |
263.6% |
Best: BF16 — safetensors header: all 23 weight tensors are BF16 (predicts 145,410,752,512 bytes, 0.0% error)
KV cache per request
| Context tokens |
KV bytes |
| 4,096 |
1.25 GB |
| 32,768 |
10.00 GB |
Recommended fleet
| Tier |
GPUs |
Weight/GPU |
Headroom/GPU |
Concurrent @ 128K |
| min |
4 |
33.86 GB |
33.20 GB |
3 |
| dev ★ |
8 |
16.93 GB |
50.13 GB |
10 |
| prod |
8 |
16.93 GB |
50.13 GB |
10 |
- Prefill latency 291 ms @ 2000 input tokens
[estimated]
- Cluster decode throughput 384 tok/s
[estimated]
- Max concurrent users 12
- Bottleneck
memory_bandwidth
Generated command
vllm serve Qwen/Qwen2.5-72B-Instruct \
--tensor-parallel-size 8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.9
Generated by:
llm-cal Qwen/Qwen2.5-72B-Instruct --gpu A100-80G --engine vllm --lang en