Skip to content

Qwen/Qwen3-30B-A3B on H100

How many H100 GPUs to run Qwen/Qwen3-30B-A3B.

Architecture

Field Value
model_type qwen3_moe
attention GQA (heads=32, kv_heads=4, hd=128)
moe 128 routed + 0 shared, top-8

Weights

Field Value Label
safetensors bytes 56.87 GB [verified]
params 30.5B [estimated]
quantization BF16 [verified]

Quantization reconciliation

Scheme Predicted Δ Error
FP16 56.87 GB 2.25 MB over 0.0%
BF16 ✓ 56.87 GB 2.25 MB over 0.0%
FP8 28.44 GB 28.44 GB over 100.0%
INT8 28.44 GB 28.44 GB over 100.0%
FP4_FP8_MIXED 15.64 GB 41.23 GB over 263.7%

Best: BF16 — safetensors header: all 1262 weight tensors are BF16 (predicts 61,064,216,576 bytes, 0.0% error)

KV cache per request

Context tokens KV bytes
4,096 384.00 MB
32,768 3.00 GB
Tier GPUs Weight/GPU Headroom/GPU Concurrent @ 128K
min 2 28.44 GB 38.62 GB 6
dev ★ 4 14.22 GB 52.84 GB 17
prod 4 14.22 GB 52.84 GB 17

Performance

  • Prefill latency 77 ms @ 2000 input tokens [estimated]
  • Cluster decode throughput 395 tok/s [estimated]
  • Max concurrent users 13
  • Bottleneck memory_bandwidth

Generated command

vllm serve Qwen/Qwen3-30B-A3B \
  --tensor-parallel-size 4 \
  --max-model-len 40960 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --enable-expert-parallel

Generated by:

llm-cal Qwen/Qwen3-30B-A3B --gpu H100 --engine vllm --lang en