deepseek-ai/DeepSeek-V4-Flash on H100
How many H100 GPUs to run deepseek-ai/DeepSeek-V4-Flash.
Architecture
| Field |
Value |
model_type |
deepseek_v4 |
attention |
CSA_HCA (heads=64, kv_heads=1, hd=512) |
moe |
256 routed + 1 shared, top-6 |
sliding_window |
128 |
Weights
| Field |
Value |
Label |
| safetensors bytes |
148.66 GB |
[verified] |
| params |
290.9B |
[estimated] |
| quantization |
FP4_FP8_MIXED [verified] |
|
Quantization reconciliation
| Scheme |
Predicted |
Δ |
Error |
| FP4_FP8_MIXED ✓ |
149.02 GB |
378.76 MB under |
0.2% |
| GPTQ_INT4 |
149.02 GB |
378.76 MB under |
0.2% |
| AWQ_INT4 |
149.02 GB |
378.76 MB under |
0.2% |
| INT4 |
135.48 GB |
13.18 GB over |
9.7% |
| FP8 |
270.95 GB |
122.30 GB under |
45.1% |
Best: FP4_FP8_MIXED — safetensors header: F8_E8M0 scale tensors + 768 packed-I8 (FP4) weights + 9 FP8 weights — MX block-scaled mixed pack (predicts 160,014,306,918 bytes, 0.2% error)
KV cache per request
| Context tokens |
KV bytes |
| 4,096 |
65.72 MB |
| 32,768 |
525.77 MB |
| 131,072 |
2.05 GB |
| 1,048,576 |
16.43 GB |
Recommended fleet
| Tier |
GPUs |
Weight/GPU |
Headroom/GPU |
Concurrent @ 128K |
| min |
4 |
37.16 GB |
29.89 GB |
14 |
| dev ★ |
4 |
37.16 GB |
29.89 GB |
14 |
| prod |
8 |
18.58 GB |
48.47 GB |
23 |
- Prefill latency 735 ms @ 2000 input tokens
[estimated]
- Cluster decode throughput 151 tok/s
[estimated]
- Max concurrent users 5
- Bottleneck
memory_bandwidth
Generated command
vllm serve deepseek-ai/DeepSeek-V4-Flash \
--tensor-parallel-size 4 \
--max-model-len 1048576 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--attention-backend auto
Generated by:
llm-cal deepseek-ai/DeepSeek-V4-Flash --gpu H100 --engine vllm --lang en