Contributing to llm-cal¶
Dev setup¶
git clone https://github.com/FlyTOmeLight/llm-cal.git
cd llm-cal
python3.11 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
# one-time: install pre-commit hooks
pre-commit install
Run the full verification loop:
ruff format src tests # auto-format
ruff check src tests # lint
mypy src # type-check (strict mode)
pytest -q # tests (must be 100% passing)
All of these are gates for PR merge.
What to work on¶
Welcome contributions, roughly in order of value:
1. Data updates (no code, easiest)¶
- New GPUs — append to
src/llm_cal/hardware/gpu_database.yamland add one test intests/test_hardware.py. - New engine compat entries — append to
src/llm_cal/engine_compat/matrix.yamlwithsources[]. Honestverification_levelrequired (see below). verifiedmatrix entries — if you have real hardware and actually ran a config, PR the result withtype: testedsources (hardware, date, metrics). This is the most valuable kind of contribution because v0.1 ships with zero verified entries.
2. New architectures¶
See Architecture Guide for the 10-step checklist. Typically touches:
- Fixture (1 file)
detector.pyKNOWN_MODEL_TYPEStraits.py(only if new attention variant)formulas/kv_cache.py(only if new KV behavior)- Tests (detector + formulas)
matrix.yamlentry
3. i18n¶
New locale = extend src/llm_cal/common/i18n.py with translations. Every key
needs an entry. Test in tests/test_i18n.py.
4. New sources¶
ModelScope SDK decision pending — see docs/adr/001-modelscope-integration-strategy.md.
Future sources (local directories, custom registries) go in
src/llm_cal/model_source/.
The two hard rules¶
Rule 1: Label discipline¶
Every user-facing number passes through AnnotatedValue[T], and the label must
match how the value was obtained. See the label table in
Architecture Guide.
Violation of this rule is the only reason a PR will be rejected outright. The whole point of the tool is that users can trust what the labels mean.
Rule 2: Honest verification levels¶
engine_compat/matrix.yaml entries:
- verified — only with type: tested sources containing real metrics
- cited — requires at least one URL with captured_date
- unverified — allowed but surfaces loudly in the UI
Don't "upgrade" a cited to verified without actually running hardware.
Code style¶
- Python 3.11+ (for
enum.StrEnum, structural-pattern match where useful) - Ruff-formatted, line length 100
- mypy strict mode, no
# type: ignorewithout a reason comment - Pydantic v2 for all YAML schemas
- Keep
cli.pythin (< 60 lines) — orchestration belongs incore/evaluator.py
Chinese punctuation is intentional in i18n / Chinese-facing error messages.
If you add a file with Chinese content, add the path to the ruff
per-file-ignores whitelist in pyproject.toml for RUF001/RUF002/RUF003.
Commit messages¶
Conventional-commits style preferred:
feat(scope): ...— new capabilityfix(scope): ...— bug fixdocs: ...— doc-only changestest: ...— test-only changeschore: ...— tooling / build
Good scope names match top-level modules: architecture, fleet,
engine_compat, output, formulas, model_source, i18n.
Testing philosophy¶
- Critical regression tests are marked with a
CRITICAL:docstring prefix and must never be weakened or removed. Examples: test_csa_hca_length_mismatch_fallthrough(detection guard)test_fp4_fp8_pack_identified(tool's core value prop)test_commit_sha_mismatch_invalidates(cache correctness)test_tp_divisibility_constraint(fleet correctness)-
test_unverified_match_shows_warning_in_output(honesty constraint) -
Fixture reuse over mocks for config-based tests. Real model
config.jsonsamples live intests/fixtures/configs/. -
No network in CI tests. Use
responseslibrary fixtures (seeded byscripts/capture-fixtures.py— planned for v0.1 finalization).
Design doc¶
The v0.1 design doc lives outside the repo (in the author's gstack workspace). Key decisions:
- Pass-through "traits composition" model (not inheritance) — lets new architectures combine freely
- Engine compat matrix as data, not code — community contributions are PRs against YAML, not Python
- Six-level labels, enforced by
enum.StrEnum - TP-aware KV sharding (matches vLLM behavior, not naive replication)
The trade-offs and rejected alternatives (thin-aggregator approach, data-driven community registry without core) are documented in the design doc.
Questions¶
Open an issue. Maintainer replies within a week (probably faster).