Qwen/Qwen3-30B-A3B on H100¶

How many H100 GPUs to run Qwen/Qwen3-30B-A3B.

Architecture¶

Field	Value
`model_type`	`qwen3_moe`
`attention`	`GQA (heads=32, kv_heads=4, hd=128)`
`moe`	`128 routed + 0 shared, top-8`

Weights¶

Field	Value	Label
safetensors bytes	56.87 GB	`[verified]`
params	30.5B	`[estimated]`
quantization	`BF16` `[verified]`

Quantization reconciliation¶

Scheme	Predicted	Δ	Error
FP16	56.87 GB	2.25 MB over	0.0%
BF16 ✓	56.87 GB	2.25 MB over	0.0%
FP8	28.44 GB	28.44 GB over	100.0%
INT8	28.44 GB	28.44 GB over	100.0%
FP4_FP8_MIXED	15.64 GB	41.23 GB over	263.7%

Best: BF16 — safetensors header: all 1262 weight tensors are BF16 (predicts 61,064,216,576 bytes, 0.0% error)

KV cache per request¶

Context tokens	KV bytes
4,096	384.00 MB
32,768	3.00 GB

Recommended fleet¶

Tier	GPUs	Weight/GPU	Headroom/GPU	Concurrent @ 128K
min	2	28.44 GB	38.62 GB	6
dev ★	4	14.22 GB	52.84 GB	17
prod	4	14.22 GB	52.84 GB	17

Performance¶

Prefill latency 77 ms @ 2000 input tokens [estimated]
Cluster decode throughput 395 tok/s [estimated]
Max concurrent users 13
Bottleneck memory_bandwidth

Generated command¶

vllm serve Qwen/Qwen3-30B-A3B \
  --tensor-parallel-size 4 \
  --max-model-len 40960 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --enable-expert-parallel

Generated by:

llm-cal Qwen/Qwen3-30B-A3B --gpu H100 --engine vllm --lang en