microsoft/Phi-4 on L40S
How many L40S GPUs to run microsoft/Phi-4.
Architecture
| Field |
Value |
model_type |
phi3 |
attention |
GQA (heads=40, kv_heads=10, hd=128) |
Weights
| Field |
Value |
Label |
| safetensors bytes |
27.31 GB |
[verified] |
| params |
14.7B |
[estimated] |
| quantization |
BF16 [verified] |
|
Quantization reconciliation
| Scheme |
Predicted |
Δ |
Error |
| FP16 |
27.31 GB |
37.92 KB over |
0.0% |
| BF16 ✓ |
27.31 GB |
37.92 KB over |
0.0% |
| FP8 |
13.65 GB |
13.65 GB over |
100.0% |
| INT8 |
13.65 GB |
13.65 GB over |
100.0% |
| FP4_FP8_MIXED |
7.51 GB |
19.80 GB over |
263.6% |
Best: BF16 — safetensors header: all 42 weight tensors are BF16 (predicts 29,319,004,160 bytes, 0.0% error)
KV cache per request
| Context tokens |
KV bytes |
| 4,096 |
800.00 MB |
Recommended fleet
| Tier |
GPUs |
Weight/GPU |
Headroom/GPU |
Concurrent @ 128K |
| min |
2 |
13.65 GB |
26.58 GB |
2 |
| dev ★ |
8 |
3.41 GB |
36.82 GB |
11 |
| prod |
8 |
3.41 GB |
36.82 GB |
11 |
- Prefill latency 51 ms @ 2000 input tokens
[estimated]
- Cluster decode throughput 679 tok/s
[estimated]
- Max concurrent users 22
- Bottleneck
memory_bandwidth
Generated command
vllm serve microsoft/Phi-4 \
--tensor-parallel-size 8 \
--max-model-len 16384 \
--gpu-memory-utilization 0.9
Generated by:
llm-cal microsoft/Phi-4 --gpu L40S --engine vllm --lang en