Submit a system
Any system reachable via an OpenAI-compatible chat completion endpoint can be evaluated. Closed commercial APIs and open-weight models are both eligible. Companion Bench is system-agnostic by contract — the harness never imports any vendor-specific package.
1. Categories
- Open weight, fully reproducible — model weights, system prompt, harness config all public.
- Closed model, API-reproducible — vendor API, system prompt and config public, results re-runnable on the same API.
- Bespoke system — composite systems with proprietary memory / personalisation layers; reproducibility limited to the same vendor instance.
We do not collapse these into one column. Comparing a closed bespoke system to an open-weight base model is not apples-to-apples.
2. Manifest schema
A submission is a single YAML manifest:
submission_id: my-team-2026-q2
system_name: My Companion v1
model_identifier: my-org/my-model-v1
base_url: https://api.my-org.example/v1
api_key_env: MY_ORG_API_KEY
system_prompt: |
You are a thoughtful, warm conversational companion…
generation_config:
temperature: 0.7
top_p: 1.0
max_tokens: 512
attestation:
no_public_test_set_tuning: true
no_scenario_specific_prompt: true
no_companion_bench_derivative_in_training: true
reproducible_endpoint: true
leaderboard_category: closed-api
The full schema lives at companion-bench-submission-protocol.md. Attestation flags must all be true; failing any of them at run time is a disqualifier.
3. Run it
pip install companion-bench
# Smoke first — uses fakes, no API spend.
companion-bench smoke
# Real submission against any OpenAI-compatible endpoint.
python scripts/companion_bench/run_real_submission.py \
--submission examples/submission.yaml \
--user-sim-model anthropic/claude-3.7-sonnet \
--user-sim-key-env ANTHROPIC_API_KEY \
--perturn-model anthropic/claude-3.7-sonnet \
--perturn-key-env ANTHROPIC_API_KEY \
--arc-model openai/gpt-5 \
--arc-key-env OPENAI_API_KEY \
--artifact-dir artifacts/companion-bench/your-submission/
Cost estimate
- Inference on submitted system: $10–60 (system-dependent)
- Per-turn rubric judge: $15–25
- Arc-level judge: $5–10
- Pairwise Elo (vs reference systems): $10–20
- Total per submission: $40–115
Compare: EQ-Bench 3 ≈ $10–15, MT-Bench ≈ $5. Neither covers multi-session.
4. Verification
Organisers re-run one random public-test arc per submission to verify reproducibility (RFC §7.3). Results that deviate beyond seed variance (> 5% per axis) are flagged. The verifier seed is derived from the submission_id so any third party can predict which arc will be re-run and compare hashes upfront.
5. Submit
Public PR: open a pull request against
companionbench/bench
adding your manifest under submissions/. The CI workflow
validates schema + attestation; an organiser triggers the scoring run.
Issue tracker: if you cannot use a PR (e.g., closed-API key cannot be shared with CI), file an issue with the submission-request template and an organiser will run the scoring on a self-hosted runner with the necessary keys.
File submission request6. What gets published
- Per-axis 0–100 scores, final geometric-mean score, A6 cap status.
- TrueSkill conservative rating (μ − 3σ) and Bradley-Terry score from pairwise comparisons.
- Cost telemetry (tokens spent on SUT, per-turn judge, arc judge).
- Per-submission detail page with full multi-session transcripts, callback ledger, per-turn rubric heatmap, and judge notes.
Held-out scenarios are scored but their bodies are never published — the leaderboard cites them by SHA-256 hash only.