Contributing to llm-cal¶

Dev setup¶

git clone https://github.com/FlyTOmeLight/llm-cal.git
cd llm-cal
python3.11 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

# one-time: install pre-commit hooks
pre-commit install

Run the full verification loop:

ruff format src tests          # auto-format
ruff check src tests            # lint
mypy src                        # type-check (strict mode)
pytest -q                       # tests (must be 100% passing)

All of these are gates for PR merge.

What to work on¶

Welcome contributions, roughly in order of value:

1. Data updates (no code, easiest)¶

New GPUs — append to src/llm_cal/hardware/gpu_database.yaml and add one test in tests/test_hardware.py.
New engine compat entries — append to src/llm_cal/engine_compat/matrix.yaml with sources[]. Honest verification_level required (see below).
verified matrix entries — if you have real hardware and actually ran a config, PR the result with type: tested sources (hardware, date, metrics). This is the most valuable kind of contribution because v0.1 ships with zero verified entries.

2. New architectures¶

See Architecture Guide for the 10-step checklist. Typically touches:

Fixture (1 file)
detector.py KNOWN_MODEL_TYPES
traits.py (only if new attention variant)
formulas/kv_cache.py (only if new KV behavior)
Tests (detector + formulas)
matrix.yaml entry

3. i18n¶

New locale = extend src/llm_cal/common/i18n.py with translations. Every key needs an entry. Test in tests/test_i18n.py.

4. New sources¶

ModelScope SDK decision pending — see docs/adr/001-modelscope-integration-strategy.md. Future sources (local directories, custom registries) go in src/llm_cal/model_source/.

The two hard rules¶

Rule 1: Label discipline¶

Every user-facing number passes through AnnotatedValue[T], and the label must match how the value was obtained. See the label table in Architecture Guide.

Violation of this rule is the only reason a PR will be rejected outright. The whole point of the tool is that users can trust what the labels mean.

Rule 2: Honest verification levels¶

engine_compat/matrix.yaml entries: - verified — only with type: tested sources containing real metrics - cited — requires at least one URL with captured_date - unverified — allowed but surfaces loudly in the UI

Don't "upgrade" a cited to verified without actually running hardware.

Code style¶

Python 3.11+ (for enum.StrEnum, structural-pattern match where useful)
Ruff-formatted, line length 100
mypy strict mode, no # type: ignore without a reason comment
Pydantic v2 for all YAML schemas
Keep cli.py thin (< 60 lines) — orchestration belongs in core/evaluator.py

Chinese punctuation is intentional in i18n / Chinese-facing error messages. If you add a file with Chinese content, add the path to the ruff per-file-ignores whitelist in pyproject.toml for RUF001/RUF002/RUF003.

Commit messages¶

Conventional-commits style preferred:

feat(scope): ... — new capability
fix(scope): ... — bug fix
docs: ... — doc-only changes
test: ... — test-only changes
chore: ... — tooling / build

Good scope names match top-level modules: architecture, fleet, engine_compat, output, formulas, model_source, i18n.

Testing philosophy¶

Critical regression tests are marked with a CRITICAL: docstring prefix and must never be weakened or removed. Examples:
test_csa_hca_length_mismatch_fallthrough (detection guard)
test_fp4_fp8_pack_identified (tool's core value prop)
test_commit_sha_mismatch_invalidates (cache correctness)
test_tp_divisibility_constraint (fleet correctness)
test_unverified_match_shows_warning_in_output (honesty constraint)
Fixture reuse over mocks for config-based tests. Real model config.json samples live in tests/fixtures/configs/.
No network in CI tests. Use responses library fixtures (seeded by scripts/capture-fixtures.py — planned for v0.1 finalization).

Design doc¶

The v0.1 design doc lives outside the repo (in the author's gstack workspace). Key decisions:

Pass-through "traits composition" model (not inheritance) — lets new architectures combine freely
Engine compat matrix as data, not code — community contributions are PRs against YAML, not Python
Six-level labels, enforced by enum.StrEnum
TP-aware KV sharding (matches vLLM behavior, not naive replication)

The trade-offs and rejected alternatives (thin-aggregator approach, data-driven community registry without core) are documented in the design doc.

Questions¶

Open an issue. Maintainer replies within a week (probably faster).