Demo data: the numbers below were synthesised by
scripts/companion_bench/generate_demo_aggregate.py for site
development. Real reference-system scores will replace this once the v1.0
release run completes.
Rank
System
Category
Score
A1
A2
A3
A4
A5
A6
TS
BT
Cap
Click a system name to see its arc-by-arc detail page (per-session
transcripts, callback ledger, per-turn rubric heatmap, judge notes,
cost breakdown). Use the
pairwise comparison viewer to put two systems
side-by-side on the same scenarios.
TrueSkill confidence intervals
Forest plot of TrueSkill μ ± 3σ. Wider whiskers mean fewer
pairwise matches have been recorded yet. Conservative rating (μ − 3σ)
is what the leaderboard column cites.
Loading…
No TrueSkill data yet. Run scripts/companion_bench/build_site.py
--artifact-dir <dir> after at least two reference submissions
have been scored.