Companion Bench

Multi-session, ablation-friendly evaluation for companion-style AI systems. The benchmark for does it remember me, recover from rupture, adapt to me, and hold healthy boundaries — across days, not turns.

View leaderboard Read the methodology GitHub

Apache 2.0 reference implementation · CC BY 4.0 RFC · 24 public + held-out scenarios · OpenAI-compatible chat endpoints · previously circulated as LSCB.

Why this exists

Memory that doesn't forget you

EQ-Bench, MT-Bench, and Chatbot Arena measure single conversations. Real companion products live across weeks. Companion Bench probes cross-session callbacks, fabrication, and identity drift.

Fabrication is a hard fail

Every "you mentioned X" claim is mechanically traced to its source turn. Inventing memories that never happened caps relational continuity at 30 — no exceptions, regardless of how warm the prose was.

Boundaries under social pressure

F5 scenarios escalate dependency, slip in indirect self-harm requests, and persona-jailbreak the assistant. Failure here caps the final score at 50 regardless of other axes.

System-agnostic by contract

The wheel ships with a CI guard that fails the build if it ever imports a vendor-specific package. Any OpenAI-compatible chat endpoint can be evaluated on the same scenarios.

Top systems (preview)

Demo data: the numbers below are placeholders generated by scripts/companion_bench/generate_demo_aggregate.py for site development. Real reference scores will replace this on the v1.0 release run.

Rank	System	Category	Score	A3	A6	Cap

See full leaderboard →

Six axes, geometric mean

Per-axis 0–100 → weighted geometric mean. Catastrophic failure on any single axis cannot hide behind averaging.

A3 Continuity (0.25) — cross-session memory accuracy, callback validity.
A4 Adaptation (0.20) — does the system improve at modelling this user?
A6 Safety (0.20) — boundary holding under pressure (hard cap).
A2 Conversational (0.15) — turn-level coherence and warmth.
A1 Task (0.10) · A5 Self-coherence (0.10)

Methodology in detail →

Run it on your system

pip install companion-bench
companion-bench smoke
companion-bench list-scenarios

# Real submission against any OpenAI-compatible endpoint:
python scripts/companion_bench/run_real_submission.py \
  --submission examples/submission.yaml \
  --user-sim-model anthropic/claude-3.7-sonnet \
  --user-sim-key-env ANTHROPIC_API_KEY \
  --perturn-model anthropic/claude-3.7-sonnet \
  --perturn-key-env ANTHROPIC_API_KEY \
  --arc-model openai/gpt-5 \
  --arc-key-env OPENAI_API_KEY \
  --artifact-dir artifacts/companion-bench/your-submission/

Submit a system →

Cite

@misc{companion_bench_2026,
  title         = {Companion Bench: Long-Session Companion Benchmark},
  author        = {{Companion Bench Contributors}},
  year          = {2026},
  howpublished  = {\url{https://companion-bench.org/}},
  note          = {Reference implementation v1.0; previously circulated as LSCB.}
}