Submit a system

Any system reachable via an OpenAI-compatible chat completion endpoint can be evaluated. Closed commercial APIs and open-weight models are both eligible. Companion Bench is system-agnostic by contract — the harness never imports any vendor-specific package.

1. Categories

Open weight, fully reproducible — model weights, system prompt, harness config all public.
Closed model, API-reproducible — vendor API, system prompt and config public, results re-runnable on the same API.
Bespoke system — composite systems with proprietary memory / personalisation layers; reproducibility limited to the same vendor instance.

We do not collapse these into one column. Comparing a closed bespoke system to an open-weight base model is not apples-to-apples.

2. Manifest schema

A submission is a single YAML manifest:

submission_id: my-team-2026-q2
system_name: My Companion v1
model_identifier: my-org/my-model-v1
base_url: https://api.my-org.example/v1
api_key_env: MY_ORG_API_KEY
system_prompt: |
  You are a thoughtful, warm conversational companion…
generation_config:
  temperature: 0.7
  top_p: 1.0
  max_tokens: 512
attestation:
  no_public_test_set_tuning: true
  no_scenario_specific_prompt: true
  no_companion_bench_derivative_in_training: true
  reproducible_endpoint: true
leaderboard_category: closed-api

The full schema lives at companion-bench-submission-protocol.md. Attestation flags must all be true; failing any of them at run time is a disqualifier.

3. Run it

pip install companion-bench

# Smoke first — uses fakes, no API spend.
companion-bench smoke

# Real submission against any OpenAI-compatible endpoint.
python scripts/companion_bench/run_real_submission.py \
  --submission examples/submission.yaml \
  --user-sim-model anthropic/claude-3.7-sonnet \
  --user-sim-key-env ANTHROPIC_API_KEY \
  --perturn-model anthropic/claude-3.7-sonnet \
  --perturn-key-env ANTHROPIC_API_KEY \
  --arc-model openai/gpt-5 \
  --arc-key-env OPENAI_API_KEY \
  --artifact-dir artifacts/companion-bench/your-submission/

Cost estimate

Inference on submitted system: $10–60 (system-dependent)
Per-turn rubric judge: $15–25
Arc-level judge: $5–10
Pairwise Elo (vs reference systems): $10–20
Total per submission: $40–115

Compare: EQ-Bench 3 ≈ $10–15, MT-Bench ≈ $5. Neither covers multi-session.

4. Verification

Organisers re-run one random public-test arc per submission to verify reproducibility (RFC §7.3). Results that deviate beyond seed variance (> 5% per axis) are flagged. The verifier seed is derived from the submission_id so any third party can predict which arc will be re-run and compare hashes upfront.

5. Submit

Public PR: open a pull request against companionbench/bench adding your manifest under submissions/. The CI workflow validates schema + attestation; an organiser triggers the scoring run.

Issue tracker: if you cannot use a PR (e.g., closed-API key cannot be shared with CI), file an issue with the submission-request template and an organiser will run the scoring on a self-hosted runner with the necessary keys.

File submission request

6. What gets published

Per-axis 0–100 scores, final geometric-mean score, A6 cap status.
TrueSkill conservative rating (μ − 3σ) and Bradley-Terry score from pairwise comparisons.
Cost telemetry (tokens spent on SUT, per-turn judge, arc judge).
Per-submission detail page with full multi-session transcripts, callback ledger, per-turn rubric heatmap, and judge notes.

Held-out scenarios are scored but their bodies are never published — the leaderboard cites them by SHA-256 hash only.