Qwen/Qwen3-30B-A3B on H100
How many H100 GPUs to run Qwen/Qwen3-30B-A3B.
Architecture
| Field |
Value |
model_type |
qwen3_moe |
attention |
GQA (heads=32, kv_heads=4, hd=128) |
moe |
128 routed + 0 shared, top-8 |
Weights
| Field |
Value |
Label |
| safetensors bytes |
56.87 GB |
[verified] |
| params |
30.5B |
[estimated] |
| quantization |
BF16 [verified] |
|
Quantization reconciliation
| Scheme |
Predicted |
Δ |
Error |
| FP16 |
56.87 GB |
2.25 MB over |
0.0% |
| BF16 ✓ |
56.87 GB |
2.25 MB over |
0.0% |
| FP8 |
28.44 GB |
28.44 GB over |
100.0% |
| INT8 |
28.44 GB |
28.44 GB over |
100.0% |
| FP4_FP8_MIXED |
15.64 GB |
41.23 GB over |
263.7% |
Best: BF16 — safetensors header: all 1262 weight tensors are BF16 (predicts 61,064,216,576 bytes, 0.0% error)
KV cache per request
| Context tokens |
KV bytes |
| 4,096 |
384.00 MB |
| 32,768 |
3.00 GB |
Recommended fleet
| Tier |
GPUs |
Weight/GPU |
Headroom/GPU |
Concurrent @ 128K |
| min |
2 |
28.44 GB |
38.62 GB |
6 |
| dev ★ |
4 |
14.22 GB |
52.84 GB |
17 |
| prod |
4 |
14.22 GB |
52.84 GB |
17 |
- Prefill latency 77 ms @ 2000 input tokens
[estimated]
- Cluster decode throughput 395 tok/s
[estimated]
- Max concurrent users 13
- Bottleneck
memory_bandwidth
Generated command
vllm serve Qwen/Qwen3-30B-A3B \
--tensor-parallel-size 4 \
--max-model-len 40960 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--enable-expert-parallel
Generated by:
llm-cal Qwen/Qwen3-30B-A3B --gpu H100 --engine vllm --lang en