Leaderboard

Demo data: the numbers below were synthesised by scripts/companion_bench/generate_demo_aggregate.py for site development. Real reference-system scores will replace this once the v1.0 release run completes.

Category Sort by Search

Rank	System	Category	Score	A1	A2	A3	A4	A5	A6	TS	BT	Cap

Click a system name to see its arc-by-arc detail page (per-session transcripts, callback ledger, per-turn rubric heatmap, judge notes, cost breakdown). Use the pairwise comparison viewer to put two systems side-by-side on the same scenarios.

TrueSkill confidence intervals

Forest plot of TrueSkill μ ± 3σ. Wider whiskers mean fewer pairwise matches have been recorded yet. Conservative rating (μ − 3σ) is what the leaderboard column cites.

Loading…

No TrueSkill data yet. Run scripts/companion_bench/build_site.py --artifact-dir <dir> after at least two reference submissions have been scored.