Qwen/Qwen2.5-72B-Instruct on A100-80G¶

How many A100-80G GPUs to run Qwen/Qwen2.5-72B-Instruct.

Architecture¶

Scheme	Predicted	Δ	Error
FP16	135.42 GB	1.68 MB over	0.0%
BF16 ✓	135.42 GB	1.68 MB over	0.0%
FP8	67.71 GB	67.71 GB over	100.0%
INT8	67.71 GB	67.71 GB over	100.0%
FP4_FP8_MIXED	37.24 GB	98.18 GB over	263.6%

Best: BF16 — safetensors header: all 23 weight tensors are BF16 (predicts 145,410,752,512 bytes, 0.0% error)

Context tokens	KV bytes
4,096	1.25 GB
32,768	10.00 GB

Tier	GPUs	Weight/GPU	Headroom/GPU	Concurrent @ 128K
min	4	33.86 GB	33.20 GB	3
dev ★	8	16.93 GB	50.13 GB	10
prod	8	16.93 GB	50.13 GB	10

vllm serve Qwen/Qwen2.5-72B-Instruct \
  --tensor-parallel-size 8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.9

Generated by:

llm-cal Qwen/Qwen2.5-72B-Instruct --gpu A100-80G --engine vllm --lang en