Skip to content

mistralai/Mixtral-8x7B-v0.1 on H100

How many H100 GPUs to run mistralai/Mixtral-8x7B-v0.1.

Architecture

Field Value
model_type mixtral
attention GQA (heads=32, kv_heads=8, hd=128)
moe 8 routed + 0 shared, top-2

Weights

Field Value Label
safetensors bytes 86.99 GB [verified]
params 46.7B [estimated]
quantization BF16 [verified]

Quantization reconciliation

Scheme Predicted Δ Error
FP16 86.99 GB 133.09 KB over 0.0%
BF16 ✓ 86.99 GB 133.09 KB over 0.0%
FP8 43.50 GB 43.50 GB over 100.0%
INT8 43.50 GB 43.50 GB over 100.0%
FP4_FP8_MIXED 23.92 GB 63.07 GB over 263.6%

Best: BF16 — safetensors header: all 48 weight tensors are BF16 (predicts 93,405,577,216 bytes, 0.0% error)

KV cache per request

Context tokens KV bytes
4,096 512.00 MB
32,768 4.00 GB
Tier GPUs Weight/GPU Headroom/GPU Concurrent @ 128K
min 2 43.50 GB 23.56 GB 2
dev ★ 4 21.75 GB 45.31 GB 11
prod 8 10.87 GB 56.18 GB 28

Performance

  • Prefill latency 118 ms @ 2000 input tokens [estimated]
  • Cluster decode throughput 258 tok/s [estimated]
  • Max concurrent users 8
  • Bottleneck memory_bandwidth

Generated command

vllm serve mistralai/Mixtral-8x7B-v0.1 \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9

Generated by:

llm-cal mistralai/Mixtral-8x7B-v0.1 --gpu H100 --engine vllm --lang en