Skip to content

deepseek-ai/DeepSeek-V3 on B200

How many B200 GPUs to run deepseek-ai/DeepSeek-V3.

Architecture

Field Value
model_type deepseek_v3
attention MLA (heads=128, kv_heads=128, hd=56)
moe 256 routed + 1 shared, top-8

Weights

Field Value Label
safetensors bytes 641.30 GB [verified]
params 695.7B [estimated]
quantization FP8 [verified]

Quantization reconciliation

Scheme Predicted Δ Error
FP8 ✓ 647.96 GB 6.66 GB under 1.0%
INT8 647.96 GB 6.66 GB under 1.0%
FP16 1.27 TB 654.62 GB under 50.5%
BF16 1.27 TB 654.62 GB under 50.5%
FP4_FP8_MIXED 356.38 GB 284.92 GB over 79.9%

Best: FP8 — config.json quantization_config.quant_method=fp8 (predicts 695,742,322,688 bytes, 1.0% error)

KV cache per request

Context tokens KV bytes
4,096 244.00 MB
32,768 1.91 GB
131,072 7.62 GB
163,840 9.53 GB
Tier GPUs Weight/GPU Headroom/GPU Concurrent @ 128K
min 8 80.16 GB 80.77 GB 84
dev ★ 8 80.16 GB 80.77 GB 84
prod 8 80.16 GB 80.77 GB 84

Performance

  • Prefill latency 387 ms @ 2000 input tokens [estimated]
  • Cluster decode throughput 335 tok/s [estimated]
  • Max concurrent users 11
  • Bottleneck memory_bandwidth

Generated command

vllm serve deepseek-ai/DeepSeek-V3 \
  --tensor-parallel-size 8 \
  --max-model-len 163840 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --trust-remote-code

Generated by:

llm-cal deepseek-ai/DeepSeek-V3 --gpu B200 --engine vllm --lang en