Skip to content

Qwen/Qwen2.5-72B-Instruct on A100-80G

How many A100-80G GPUs to run Qwen/Qwen2.5-72B-Instruct.

Architecture

Field Value
model_type qwen2
attention GQA (heads=64, kv_heads=8, hd=128)
sliding_window 131072

Weights

Field Value Label
safetensors bytes 135.43 GB [verified]
params 72.7B [estimated]
quantization BF16 [verified]

Quantization reconciliation

Scheme Predicted Δ Error
FP16 135.42 GB 1.68 MB over 0.0%
BF16 ✓ 135.42 GB 1.68 MB over 0.0%
FP8 67.71 GB 67.71 GB over 100.0%
INT8 67.71 GB 67.71 GB over 100.0%
FP4_FP8_MIXED 37.24 GB 98.18 GB over 263.6%

Best: BF16 — safetensors header: all 23 weight tensors are BF16 (predicts 145,410,752,512 bytes, 0.0% error)

KV cache per request

Context tokens KV bytes
4,096 1.25 GB
32,768 10.00 GB
Tier GPUs Weight/GPU Headroom/GPU Concurrent @ 128K
min 4 33.86 GB 33.20 GB 3
dev ★ 8 16.93 GB 50.13 GB 10
prod 8 16.93 GB 50.13 GB 10

Performance

  • Prefill latency 291 ms @ 2000 input tokens [estimated]
  • Cluster decode throughput 384 tok/s [estimated]
  • Max concurrent users 12
  • Bottleneck memory_bandwidth

Generated command

vllm serve Qwen/Qwen2.5-72B-Instruct \
  --tensor-parallel-size 8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.9

Generated by:

llm-cal Qwen/Qwen2.5-72B-Instruct --gpu A100-80G --engine vllm --lang en