deepseek-ai/DeepSeek-V3 on H100¶

How many H100 GPUs to run deepseek-ai/DeepSeek-V3.

Architecture¶

Field	Value
`model_type`	`deepseek_v3`
`attention`	`MLA (heads=128, kv_heads=128, hd=56)`
`moe`	`256 routed + 1 shared, top-8`

Weights¶

Field	Value	Label
safetensors bytes	641.30 GB	`[verified]`
params	695.7B	`[estimated]`
quantization	`FP8` `[verified]`

Quantization reconciliation¶

Scheme	Predicted	Δ	Error
FP8 ✓	647.96 GB	6.66 GB under	1.0%
INT8	647.96 GB	6.66 GB under	1.0%
FP16	1.27 TB	654.62 GB under	50.5%
BF16	1.27 TB	654.62 GB under	50.5%
FP4_FP8_MIXED	356.38 GB	284.92 GB over	79.9%

Best: FP8 — config.json quantization_config.quant_method=fp8 (predicts 695,742,322,688 bytes, 1.0% error)

KV cache per request¶

Context tokens	KV bytes
4,096	244.00 MB
32,768	1.91 GB
131,072	7.62 GB
163,840	9.53 GB

Recommended fleet¶

Tier	GPUs	Weight/GPU
min	8	80.16 GB
dev	8	80.16 GB
prod ★	8	80.16 GB

Performance¶

Prefill latency 879 ms @ 2000 input tokens [estimated]
Cluster decode throughput 140 tok/s [estimated]
Max concurrent users 0
Bottleneck memory_capacity

Generated command¶

vllm serve deepseek-ai/DeepSeek-V3 \
  --tensor-parallel-size 8 \
  --max-model-len 163840 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --trust-remote-code

Generated by:

llm-cal deepseek-ai/DeepSeek-V3 --gpu H100 --engine vllm --lang en