Skip to content

microsoft/Phi-4 on L40S

How many L40S GPUs to run microsoft/Phi-4.

Architecture

Field Value
model_type phi3
attention GQA (heads=40, kv_heads=10, hd=128)

Weights

Field Value Label
safetensors bytes 27.31 GB [verified]
params 14.7B [estimated]
quantization BF16 [verified]

Quantization reconciliation

Scheme Predicted Δ Error
FP16 27.31 GB 37.92 KB over 0.0%
BF16 ✓ 27.31 GB 37.92 KB over 0.0%
FP8 13.65 GB 13.65 GB over 100.0%
INT8 13.65 GB 13.65 GB over 100.0%
FP4_FP8_MIXED 7.51 GB 19.80 GB over 263.6%

Best: BF16 — safetensors header: all 42 weight tensors are BF16 (predicts 29,319,004,160 bytes, 0.0% error)

KV cache per request

Context tokens KV bytes
4,096 800.00 MB
Tier GPUs Weight/GPU Headroom/GPU Concurrent @ 128K
min 2 13.65 GB 26.58 GB 2
dev ★ 8 3.41 GB 36.82 GB 11
prod 8 3.41 GB 36.82 GB 11

Performance

  • Prefill latency 51 ms @ 2000 input tokens [estimated]
  • Cluster decode throughput 679 tok/s [estimated]
  • Max concurrent users 22
  • Bottleneck memory_bandwidth

Generated command

vllm serve microsoft/Phi-4 \
  --tensor-parallel-size 8 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.9

Generated by:

llm-cal microsoft/Phi-4 --gpu L40S --engine vllm --lang en