deepseek-ai/DeepSeek-V4-Flash on B200¶

How many B200 GPUs to run deepseek-ai/DeepSeek-V4-Flash.

Architecture¶

Field	Value
`model_type`	`deepseek_v4`
`attention`	`CSA_HCA (heads=64, kv_heads=1, hd=512)`
`moe`	`256 routed + 1 shared, top-6`
`sliding_window`	`128`

Weights¶

Field	Value	Label
safetensors bytes	148.66 GB	`[verified]`
params	290.9B	`[estimated]`
quantization	`FP4_FP8_MIXED` `[verified]`

Quantization reconciliation¶

Scheme	Predicted	Δ	Error
FP4_FP8_MIXED ✓	149.02 GB	378.76 MB under	0.2%
GPTQ_INT4	149.02 GB	378.76 MB under	0.2%
AWQ_INT4	149.02 GB	378.76 MB under	0.2%
INT4	135.48 GB	13.18 GB over	9.7%
FP8	270.95 GB	122.30 GB under	45.1%

Best: FP4_FP8_MIXED — safetensors header: F8_E8M0 scale tensors + 768 packed-I8 (FP4) weights + 9 FP8 weights — MX block-scaled mixed pack (predicts 160,014,306,918 bytes, 0.2% error)

KV cache per request¶

Context tokens	KV bytes
4,096	65.72 MB
32,768	525.77 MB
131,072	2.05 GB
1,048,576	16.43 GB

Recommended fleet¶

Tier	GPUs	Weight/GPU	Headroom/GPU	Concurrent @ 128K
min	1	148.66 GB	12.28 GB	5
dev ★	2	74.33 GB	86.61 GB	42
prod	2	74.33 GB	86.61 GB	42

Performance¶

Prefill latency 647 ms @ 2000 input tokens [estimated]
Cluster decode throughput 90 tok/s [estimated]
Max concurrent users 3
Bottleneck memory_bandwidth

Generated command¶

vllm serve deepseek-ai/DeepSeek-V4-Flash \
  --tensor-parallel-size 2 \
  --max-model-len 1048576 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --attention-backend auto

Generated by:

llm-cal deepseek-ai/DeepSeek-V4-Flash --gpu B200 --engine vllm --lang en