deepseek-ai/DeepSeek-V3 on B200¶

How many B200 GPUs to run deepseek-ai/DeepSeek-V3.

Architecture¶

Field	Value
`model_type`	`deepseek_v3`
`attention`	`MLA (heads=128, kv_heads=128, hd=56)`
`moe`	`256 routed + 1 shared, top-8`

Weights¶

Field	Value	Label
safetensors bytes	641.30 GB	`[verified]`
params	695.7B	`[estimated]`
quantization	`FP8` `[verified]`

Quantization reconciliation¶

Scheme	Predicted	Δ	Error
FP8 ✓	647.96 GB	6.66 GB under	1.0%
INT8	647.96 GB	6.66 GB under	1.0%
FP16	1.27 TB	654.62 GB under	50.5%
BF16	1.27 TB	654.62 GB under	50.5%
FP4_FP8_MIXED	356.38 GB	284.92 GB over	79.9%

Best: FP8 — config.json quantization_config.quant_method=fp8 (predicts 695,742,322,688 bytes, 1.0% error)

KV cache per request¶

Context tokens	KV bytes
4,096	244.00 MB
32,768	1.91 GB
131,072	7.62 GB
163,840	9.53 GB

Recommended fleet¶

Tier	GPUs	Weight/GPU	Headroom/GPU	Concurrent @ 128K
min	8	80.16 GB	80.77 GB	84
dev ★	8	80.16 GB	80.77 GB	84
prod	8	80.16 GB	80.77 GB	84

Performance¶

Prefill latency 387 ms @ 2000 input tokens [estimated]
Cluster decode throughput 335 tok/s [estimated]
Max concurrent users 11
Bottleneck memory_bandwidth

Generated command¶

vllm serve deepseek-ai/DeepSeek-V3 \
  --tensor-parallel-size 8 \
  --max-model-len 163840 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --trust-remote-code

Generated by:

llm-cal deepseek-ai/DeepSeek-V3 --gpu B200 --engine vllm --lang en