Quickstart¶

The canonical run¶

llm-cal deepseek-ai/DeepSeek-V4-Flash --gpu H800 --engine vllm

This is the tool's reference case. You get back:

Architecture profile — DeepSeek-V4 detected, CSA+HCA + MoE + sliding window, confidence: high.
Weights — safetensors bytes: 159.62 GB [verified], quantization guess: FP4_FP8_MIXED [inferred].
Reconciliation — predicted bytes under each quantization scheme. FP4_FP8_MIXED wins at 0.2% error; FP8 is off by 45.1%.
KV cache — 4K / 32K / 128K / 1M context length estimates.
Engine compatibility — vLLM ≥0.19.0, [cited] with source URLs.
Target hardware — H800 spec with bilingual notes.
Recommended fleet — min / dev / prod tiers with TP-aware KV sharding.
Generated command — paste-ready vllm serve ....

Common flags¶

# Chinese output
llm-cal <model> --gpu H800 --engine vllm --lang zh

# Force a specific GPU count (skip min/dev/prod recommendation)
llm-cal <model> --gpu H100 --gpu-count 4

# Override context length for KV cache math
llm-cal <model> --gpu H800 --context-length 65536

# Bypass cache (useful after a model repo update)
llm-cal <model> --gpu H800 --refresh

# See all supported GPUs
llm-cal --list-gpus

# Validate tool output against curated reference values
llm-cal --benchmark

Non-NVIDIA examples¶

# AMD flagship (256 GB HBM3E, the largest single-card memory)
llm-cal deepseek-ai/DeepSeek-V4-Flash --gpu MI325X --engine vllm

# Huawei Ascend 910B4 (inference variant, 32 GB)
llm-cal Qwen/Qwen2.5-7B-Instruct --gpu 910B4 --engine vllm

# Chinese accelerators by alias
llm-cal Qwen/Qwen2.5-14B-Instruct --gpu 曦云C500 --engine vllm
llm-cal deepseek-ai/DeepSeek-V4-Flash --gpu 昆仑芯P800 --engine vllm
llm-cal Qwen/Qwen2.5-7B-Instruct --gpu 摩尔线程S4000 --engine vllm

Output labels explained¶

Every number in the report is tagged:

Tag	Meaning	Example
`[verified]`	Direct read from API or file	`safetensors bytes: 159.62 GB` (HF siblings API)
`[inferred]`	One-step derivation from verified data	`bits/param: 4.39` (bytes ÷ params)
`[estimated]`	Formula-based computation	`KV cache @ 128K: 2.21 GB`
`[cited]`	External source (release note / PR)	`vLLM ≥0.19.0 supports CSA+HCA`
`[unverified]`	Matrix entry without evidence — flagged	`SGLang day-0 support pending`
`[unknown]`	Couldn't identify, graceful degrade	New model type not in registry

Do NOT trust any tool that gives you a single number without provenance. That's the tool's value prop.

Exit codes¶

Code	Meaning
0	Success, or `--benchmark` all PASS
1	`--benchmark` has failures
2	Authentication required (gated model without `HF_TOKEN`)
3	Model not found
4	Source unavailable (network / rate limit / 5xx)

These make the tool scriptable in CI.

Troubleshooting¶

"需要认证 / Authentication required"¶

Set HF_TOKEN:

export HF_TOKEN=hf_xxxxx

Or, on first run, huggingface-cli login.

Slow HF API in China¶

Set the mirror endpoint:

export HF_ENDPOINT=https://hf-mirror.com

Tool says `model_type not in v0.1 matrix`¶

The engine compatibility matrix doesn't yet have an entry for this model family. The rest of the report (weights, KV cache, fleet recommendation) still works — just the engine section shows "no match".

Consider contributing a matrix entry via PR — see Contributing.

Tool reports `[unknown]` architecture¶

A brand-new model type the detector doesn't recognize yet. Tool falls back to:

[verified] safetensors bytes (still reliable)
No KV cache estimate
No engine compatibility info
Conservative fleet recommendation based on weight fit only

This is by design. See the "Graceful unknown" section in the Architecture Guide for the fallback tree.