llm-cal¶

LLM inference hardware calculator — architecture-aware, engine-version-aware, honest-labeled.

Give it a HuggingFace / ModelScope model id and a GPU type, get back:

real weight size (read from safetensors metadata, not guessed)
architecture profile: MLA, NSA, CSA+HCA, MoE, sliding window — each a first-class trait
KV cache per request at multiple context lengths
recommended fleet size: min / dev / prod with TP-aware KV sharding
engine compatibility from a curated matrix (vLLM & SGLang × 16 architecture families)
a ready-to-paste vllm serve or sglang launch_server command

Output is bilingual — English and 中文.

Why another calculator?¶

Existing tools (gpu_poor, llm-vram-calculator, APXML, SelfHostLLM, ...) all compute weight size using params × precision. That silently fails on new architectures:

Model	`gpu_poor` says	Real `safetensors`	llm-cal
DeepSeek-V4-Flash (FP4+FP8 pack)	284 GB (FP8 assumption)	160 GB	160 GB ✓
Standard FP8 models	correct	correct	correct ✓

llm-cal reads the real file sizes from the HuggingFace API, then compares against every known quantization scheme — the best match wins. The DeepSeek-V4 story becomes explicit:

Quantization reconciliation (observed vs predicted per scheme)
  scheme           predicted bytes    delta         error %
  FP4_FP8_MIXED        160.01 GB     397 MB under   0.2%  ← wins
  FP8                  290.94 GB     131 GB under   45.1% ← the gpu_poor trap

And every number has a tag telling you where it came from:

[verified] — read directly from HF API / config.json
[inferred] — derived from [verified] in a single step
[estimated] — computed by a formula (KV cache, weight split)
[cited] — from release notes / PR / announcement
[unverified] — matrix entry without evidence (explicitly flagged)
[unknown] — failed to recognize, graceful degrade

Install¶

Requires Python 3.11+.

pipx (recommended)uvpip

pipx install git+https://github.com/FlyTOmeLight/llm-cal.git

uv tool install git+https://github.com/FlyTOmeLight/llm-cal.git

pip install git+https://github.com/FlyTOmeLight/llm-cal.git

Auth (for gated models like Llama, Gemma):

export HF_TOKEN=hf_...

Chinese mirror (if HF is slow):

export HF_ENDPOINT=https://hf-mirror.com

Quickstart¶

llm-cal deepseek-ai/DeepSeek-V4-Flash --gpu H800 --engine vllm

See Quickstart for full walkthrough, Architecture Guide for how the tool works, and Contributing for how to add models, GPUs, or engine support.

Validation¶

Run the built-in benchmark against curated reference data:

llm-cal --benchmark

Current result: 33/33 PASS across 8 reference models, 6 check types. Every expected value in the dataset cites its source (HF API / model card / vLLM recipe / hand computation). See the benchmark section of the contributing guide.

Supported¶

47 GPUs across NVIDIA / AMD / Intel Habana / Huawei Ascend / Cambricon / Moore Threads / MetaX / KunlunXin / Biren / Iluvatar / Hygon
16 architecture families in the engine compatibility matrix
2 inference engines: vLLM and SGLang
2 output languages: English and 中文

Run llm-cal --list-gpus to see the full GPU table with aliases.

License¶

Apache-2.0. See LICENSE.