deepseek-ai/DeepSeek-V4-Flash 跑在 910B4
deepseek-ai/DeepSeek-V4-Flash 在 910B4 上需要多少 GPU。
架构
| Field |
Value |
model_type |
deepseek_v4 |
attention |
CSA_HCA (heads=64, kv_heads=1, hd=512) |
moe |
256 routed + 1 shared, top-6 |
sliding_window |
128 |
权重
| Field |
Value |
Label |
| safetensors 字节 |
148.66 GB |
[已验证] |
| 参数量 |
290.9B |
[估算] |
| 量化方案 |
FP4_FP8_MIXED [已验证] |
|
量化反演
| Scheme |
Predicted |
Δ |
Error |
| FP4_FP8_MIXED ✓ |
149.02 GB |
378.76 MB 偏少 |
0.2% |
| GPTQ_INT4 |
149.02 GB |
378.76 MB 偏少 |
0.2% |
| AWQ_INT4 |
149.02 GB |
378.76 MB 偏少 |
0.2% |
| INT4 |
135.48 GB |
13.18 GB 偏多 |
9.7% |
| FP8 |
270.95 GB |
122.30 GB 偏少 |
45.1% |
Best: FP4_FP8_MIXED — safetensors header: F8_E8M0 scale tensors + 768 packed-I8 (FP4) weights + 9 FP8 weights — MX block-scaled mixed pack (predicts 160,014,306,918 bytes, 0.2% error)
KV 缓存(每请求)
| Context tokens |
KV bytes |
| 4,096 |
65.72 MB |
| 32,768 |
525.77 MB |
| 131,072 |
2.05 GB |
| 1,048,576 |
16.43 GB |
推荐集群
| Tier |
GPUs |
Weight/GPU |
Headroom/GPU |
Concurrent @ 128K |
| min ★ |
8 |
18.58 GB |
8.24 GB |
4 |
| dev |
8 |
18.58 GB |
8.24 GB |
4 |
| prod |
8 |
18.58 GB |
8.24 GB |
4 |
性能
- Prefill latency 1299 ms @ 2000 input tokens
[估算]
- Cluster decode throughput 265 tok/s
[估算]
- Max concurrent users 4
- Bottleneck
memory_capacity
生成命令
vllm serve deepseek-ai/DeepSeek-V4-Flash \
--tensor-parallel-size 8 \
--max-model-len 1048576 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--attention-backend auto
生成方式:
llm-cal deepseek-ai/DeepSeek-V4-Flash --gpu 910B4 --engine vllm --lang zh