Architecture Guide¶
How the tool identifies a model, computes its memory footprint, and matches it to engines + hardware. Read this before contributing.
The core insight: architectures are traits, not labels¶
A model isn't "MoE" or "MLA" — it's a composition of traits. DeepSeek-V3.2 is
MoE + MLA + NSA. DeepSeek-V4 is MoE + MLA + CSA+HCA + sliding window. Qwen3 is
dense + GQA + RoPE. The tool captures this with ArchitectureProfile:
@dataclass(frozen=True)
class ArchitectureProfile:
model_type: str
family: Family # transformer | state_space | unknown
num_hidden_layers: int
hidden_size: int
vocab_size: int
attention: AttentionTraits # variant (MHA/GQA/MQA/MLA/NSA/CSA_HCA) + shape
moe: MoETraits | None # None = dense
position: PositionTraits # RoPE / YaRN / AliBi / none
sliding_window: int | None
auxiliary: dict # pass-through for future traits
confidence: Confidence # HIGH | MEDIUM | LOW
Single-module dispatch (if model_type == "deepseek_v4": ...) can't express
this combination. Traits can.
Detection flow¶
┌──────────────────────┐
│ config.json dict │
└──────────┬───────────┘
│
▼
┌──────────────────────────────────┐
│ model_type in STATE_SPACE_TYPES? │
│ or `ssm_cfg` present? │
└──────┬─────────────────┬─────────┘
yes │no
▼ ▼
Family.STATE_SPACE ┌────────────────────────────┐
(v0.1 unsupported) │ model_type AND architectures│
│ both missing? │
└──────┬─────────────────┬───┘
yes │no
▼ ▼
_fallback_unknown ┌─────────────────────┐
(Family.UNKNOWN, │ required fields │
confidence=LOW) │ (layers/hidden)? │
└──────┬──────────┬───┘
│missing │ok
▼ ▼
_fallback_unknown │
▼
┌───────────────────────┐
│ gather independent │
│ trait sub-detectors: │
│ detect_attention() │
│ detect_moe() │
│ detect_position() │
│ detect_sliding_window│
└───────────┬───────────┘
▼
┌───────────────────────┐
│ ArchitectureProfile │
│ family=TRANSFORMER │
│ confidence = HIGH if │
│ model_type known, │
│ else MEDIUM │
└───────────────────────┘
Attention variant detection — order matters¶
detect_attention() is order-sensitive. First match wins on variant, but shape
fields (heads / kv_heads / head_dim) are always populated.
priority key example model
─────────────────────────────────────────────────────────────
1 CSA_HCA compress_ratios (matched DeepSeek-V4-Flash
length vs n_hidden_layers
± num_nextn_predict_layers)
2 NSA nsa_config present DeepSeek-V3.2
OR sparse_attention_cfg
3 MLA q_lora_rank or kv_lora_rank DeepSeek-V2, V3
4 MQA num_kv_heads == 1 (when no MLA keys)
5 GQA num_kv_heads < num_heads Llama-3, Qwen
6 MHA default old Llama 1/2
Why the CSA_HCA length check matters: compress_ratios as a key name could
legitimately appear in future architectures with different semantics. The length
equality check (len == num_hidden_layers or == num_hidden_layers +
num_nextn_predict_layers) is the guard that prevents false positives. Reviewer
flagged this twice during design review; regression test:
tests/test_detector.py::test_length_mismatch_is_not_classified_as_csa_hca.
KV cache formula — traits composition¶
baseline_per_token_per_layer_per_req = 2 (K+V) × num_kv_heads × head_dim × dtype_bytes
(or, for MLA: kv_lora_rank × dtype_bytes)
baseline = baseline_per_token × effective_seq_len × num_hidden_layers
effective_seq_len:
if sparse_variant (CSA_HCA, NSA): seq_len (sliding_window does NOT apply — the
sparse mechanism already encodes
per-layer reduction)
elif sliding_window present: min(seq_len, sliding_window)
else: seq_len
compositional modifiers:
CSA_HCA: baseline × average_compress_ratio(compress_ratios)
(0 → keep 1.0, N>0 → keep 1/N, averaged across all layers)
NSA: baseline × (nsa_topk / effective_seq_len), clamped to [0, 1]
per_gpu_KV = total_KV / min(tp_size, max(1, num_kv_heads))
— MQA (kv_heads=1): always replicates (divisor = 1)
— GQA (kv_heads=G): splits up to G ways
— MHA: splits fully
Validation: DeepSeek-V4-Flash at 128K context vs hand-math:
error < 1% in tests/test_formulas.py::test_128k_kv_cache_within_1_percent.
How to add a new architecture (10-step checklist)¶
Suppose a model FooModel ships with model_type=foo_v1, dense + GQA, and a
novel reuse_kv_across_layers: bool config flag that halves KV cache.
-
Sample config.json — save a copy under
tests/fixtures/configs/foo_v1.json. This is your test anchor. The more realistic, the better. -
Register the model_type in
This flips confidence from MEDIUM to HIGH.src/llm_cal/architecture/detector.py: -
Extend
AttentionTraits(only if a new variant is needed). In this case, reuse_kv is a multiplier on the standard formula, not a new variant — skip this step. If you ARE introducing a new variant (say, "FOO_SPARSE"), extend: -
Extend
detect_attention()intraits.py(only if new variant). Add detection logic with the correct priority order. Remember: first match wins. -
Pass through new config fields via auxiliary. In
detector.py's main path, add: -
Modify the KV formula in
architecture/formulas/kv_cache.pyto read the new field: -
Add a fixture-based detector test in
tests/test_detector.py: -
Add a formula test in
tests/test_formulas.py: -
Add a compat matrix entry in
src/llm_cal/engine_compat/matrix.yamlwith at least one source (release notes / PR). Mark verification_level honestly: -
Add i18n for any new trait string in
src/llm_cal/common/i18n.py. If your new field is user-facing in the formatter, bothenandzhmust exist.
Run the full suite: pytest + mypy src + ruff check. If green, PR.
How to add a new GPU¶
Single-file change: src/llm_cal/hardware/gpu_database.yaml. No code.
- id: MI300X
aliases: [MI300X-192G]
memory_gb: 192
nvlink_bandwidth_gbps: 0 # xGMI instead — caller handles interconnect notes
fp16_tflops: 1307
fp8_support: true
fp4_support: false
notes_en: "AMD flagship. xGMI interconnect. vLLM support via ROCm."
notes_zh: "AMD 旗舰,xGMI 互联,vLLM 通过 ROCm 支持。"
That's it. The GPU becomes queryable by ID or any alias immediately.
Add tests/test_hardware.py::test_mi300x_is_loaded if you want to anchor
the new entry against regressions.
How to add an engine compat entry¶
src/llm_cal/engine_compat/matrix.yaml. Follow the existing format. Critical
rules:
verification_level: verifiedrequires at least onetype: testedsource withtester,date,hardware,metrics. Don't claimverifiedunless you actually ran it.verification_level: citedrequires at least onesources[]entry with a URL andcaptured_date.verification_level: unverified— emptysources[]is allowed, but the tool will surface this loudly in the UI so users know the entry is a guess.- Bilingual notes:
caveats_enandcaveats_zhboth required (can be empty arrays). Flag-levelnote_en/note_zhare optional.
Label discipline — the tool's soul¶
Every number in the output is tagged. This is non-negotiable. Rules by context:
| Source of value | Label |
|---|---|
HF API model_info().siblings[].size |
[verified] |
config.json field read directly |
[verified] |
sum(safetensors file sizes) |
[verified] (it IS a direct read, even though it's a sum) |
observed_bytes / total_params = bits/param |
[inferred] |
| Nearest-anchor quantization match | [inferred] |
| KV cache computed from profile | [estimated] |
| Weight computed from profile (not observed) | [estimated] |
| Engine support from matrix with sources | [cited] |
| Engine support from matrix without sources | [unverified] |
| Graceful degradation path (unknown architecture) | [unknown] |
Do NOT:
- Label a computed number [verified]. Even bits/param is [inferred]
because it's a derivation.
- Show a green checkmark next to an [unverified] entry. The UI intentionally
makes these loud.
- Introduce a new label without updating i18n keys and legend rendering.