deepseek-ai/DeepSeek-V4-Flash on H100¶

How many H100 GPUs to run deepseek-ai/DeepSeek-V4-Flash.

Architecture¶

Field	Value
`model_type`	`deepseek_v4`
`attention`	`CSA_HCA (heads=64, kv_heads=1, hd=512)`
`moe`	`256 routed + 1 shared, top-6`
`sliding_window`	`128`

Weights¶

Field	Value	Label
safetensors bytes	148.66 GB	`[verified]`
params	290.9B	`[estimated]`
quantization	`FP4_FP8_MIXED` `[verified]`

Quantization reconciliation¶

Scheme	Predicted	Δ	Error
FP4_FP8_MIXED ✓	149.02 GB	378.76 MB under	0.2%
GPTQ_INT4	149.02 GB	378.76 MB under	0.2%
AWQ_INT4	149.02 GB	378.76 MB under	0.2%
INT4	135.48 GB	13.18 GB over	9.7%
FP8	270.95 GB	122.30 GB under	45.1%

Best: FP4_FP8_MIXED — safetensors header: F8_E8M0 scale tensors + 768 packed-I8 (FP4) weights + 9 FP8 weights — MX block-scaled mixed pack (predicts 160,014,306,918 bytes, 0.2% error)

KV cache per request¶

Context tokens	KV bytes
4,096	65.72 MB
32,768	525.77 MB
131,072	2.05 GB
1,048,576	16.43 GB

Recommended fleet¶

Tier	GPUs	Weight/GPU	Headroom/GPU	Concurrent @ 128K
min	4	37.16 GB	29.89 GB	14
dev ★	4	37.16 GB	29.89 GB	14
prod	8	18.58 GB	48.47 GB	23

Performance¶

Prefill latency 735 ms @ 2000 input tokens [estimated]
Cluster decode throughput 151 tok/s [estimated]
Max concurrent users 5
Bottleneck memory_bandwidth

Generated command¶

vllm serve deepseek-ai/DeepSeek-V4-Flash \
  --tensor-parallel-size 4 \
  --max-model-len 1048576 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --attention-backend auto

Generated by:

llm-cal deepseek-ai/DeepSeek-V4-Flash --gpu H100 --engine vllm --lang en