mistralai/Mixtral-8x7B-v0.1 跑在 A100-80G¶

mistralai/Mixtral-8x7B-v0.1 在 A100-80G 上需要多少 GPU。

架构¶

Field	Value
`model_type`	`mixtral`
`attention`	`GQA (heads=32, kv_heads=8, hd=128)`
`moe`	`8 routed + 0 shared, top-2`

权重¶

Field	Value	Label
safetensors 字节	86.99 GB	`[已验证]`
参数量	46.7B	`[估算]`
量化方案	`BF16` `[已验证]`

量化反演¶

Scheme	Predicted	Δ	Error
FP16	86.99 GB	133.09 KB 偏多	0.0%
BF16 ✓	86.99 GB	133.09 KB 偏多	0.0%
FP8	43.50 GB	43.50 GB 偏多	100.0%
INT8	43.50 GB	43.50 GB 偏多	100.0%
FP4_FP8_MIXED	23.92 GB	63.07 GB 偏多	263.6%

Best: BF16 — safetensors header: all 48 weight tensors are BF16 (predicts 93,405,577,216 bytes, 0.0% error)

KV 缓存（每请求）¶

Context tokens	KV bytes
4,096	512.00 MB
32,768	4.00 GB

性能¶

Prefill latency 374 ms @ 2000 input tokens [估算]
Cluster decode throughput 149 tok/s [估算]
Max concurrent users 4
Bottleneck memory_bandwidth

生成命令¶

vllm serve mistralai/Mixtral-8x7B-v0.1 \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9

生成方式:

llm-cal mistralai/Mixtral-8x7B-v0.1 --gpu A100-80G --engine vllm --lang zh

Tier	GPUs	Weight/GPU	Headroom/GPU	Concurrent @ 128K
min	2	43.50 GB	23.56 GB	2
dev ★	4	21.75 GB	45.31 GB	11
prod	8	10.87 GB	56.18 GB	28

mistralai/Mixtral-8x7B-v0.1 跑在 A100-80G¶

架构¶

权重¶

量化反演¶

KV 缓存（每请求）¶

推荐集群¶

性能¶

生成命令¶