Skip to content

deepseek-ai/DeepSeek-V4-Flash on B200

How many B200 GPUs to run deepseek-ai/DeepSeek-V4-Flash.

Architecture

Field Value
model_type deepseek_v4
attention CSA_HCA (heads=64, kv_heads=1, hd=512)
moe 256 routed + 1 shared, top-6
sliding_window 128

Weights

Field Value Label
safetensors bytes 148.66 GB [verified]
params 290.9B [estimated]
quantization FP4_FP8_MIXED [verified]

Quantization reconciliation

Scheme Predicted Δ Error
FP4_FP8_MIXED ✓ 149.02 GB 378.76 MB under 0.2%
GPTQ_INT4 149.02 GB 378.76 MB under 0.2%
AWQ_INT4 149.02 GB 378.76 MB under 0.2%
INT4 135.48 GB 13.18 GB over 9.7%
FP8 270.95 GB 122.30 GB under 45.1%

Best: FP4_FP8_MIXED — safetensors header: F8_E8M0 scale tensors + 768 packed-I8 (FP4) weights + 9 FP8 weights — MX block-scaled mixed pack (predicts 160,014,306,918 bytes, 0.2% error)

KV cache per request

Context tokens KV bytes
4,096 65.72 MB
32,768 525.77 MB
131,072 2.05 GB
1,048,576 16.43 GB
Tier GPUs Weight/GPU Headroom/GPU Concurrent @ 128K
min 1 148.66 GB 12.28 GB 5
dev ★ 2 74.33 GB 86.61 GB 42
prod 2 74.33 GB 86.61 GB 42

Performance

  • Prefill latency 647 ms @ 2000 input tokens [estimated]
  • Cluster decode throughput 90 tok/s [estimated]
  • Max concurrent users 3
  • Bottleneck memory_bandwidth

Generated command

vllm serve deepseek-ai/DeepSeek-V4-Flash \
  --tensor-parallel-size 2 \
  --max-model-len 1048576 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --attention-backend auto

Generated by:

llm-cal deepseek-ai/DeepSeek-V4-Flash --gpu B200 --engine vllm --lang en