Skip to content

deepseek-ai/DeepSeek-V3 on H100

How many H100 GPUs to run deepseek-ai/DeepSeek-V3.

Architecture

Field Value
model_type deepseek_v3
attention MLA (heads=128, kv_heads=128, hd=56)
moe 256 routed + 1 shared, top-8

Weights

Field Value Label
safetensors bytes 641.30 GB [verified]
params 695.7B [estimated]
quantization FP8 [verified]

Quantization reconciliation

Scheme Predicted Δ Error
FP8 ✓ 647.96 GB 6.66 GB under 1.0%
INT8 647.96 GB 6.66 GB under 1.0%
FP16 1.27 TB 654.62 GB under 50.5%
BF16 1.27 TB 654.62 GB under 50.5%
FP4_FP8_MIXED 356.38 GB 284.92 GB over 79.9%

Best: FP8 — config.json quantization_config.quant_method=fp8 (predicts 695,742,322,688 bytes, 1.0% error)

KV cache per request

Context tokens KV bytes
4,096 244.00 MB
32,768 1.91 GB
131,072 7.62 GB
163,840 9.53 GB
Tier GPUs Weight/GPU Headroom/GPU Concurrent @ 128K
min 8 80.16 GB 0 B 0
dev 8 80.16 GB 0 B 0
prod ★ 8 80.16 GB 0 B 0

Performance

  • Prefill latency 879 ms @ 2000 input tokens [estimated]
  • Cluster decode throughput 140 tok/s [estimated]
  • Max concurrent users 0
  • Bottleneck memory_capacity

Generated command

vllm serve deepseek-ai/DeepSeek-V3 \
  --tensor-parallel-size 8 \
  --max-model-len 163840 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --trust-remote-code

Generated by:

llm-cal deepseek-ai/DeepSeek-V3 --gpu H100 --engine vllm --lang en