Methodology
Methodology
OpenSOTA is a meta-leaderboard. We do not run our own evaluations; we aggregate the strongest public ones.
Weighting
Each source has a fixed weight per category. The composite score is the weighted average over the sources that *have* scored a model. We also report a coverage percentage so newly released models with partial source coverage are visible without being unfairly buried — they're simply marked 'limited' until more sources publish their evaluations.
Model normalisation
We map every reported model identifier to a canonical name (e.g. 'claude-opus-4-7', 'Opus 4.7', 'Anthropic Claude Opus 4.7' all collapse to 'Claude Opus 4.7').
Disclaimer
Benchmarks are imperfect proxies. Production performance depends heavily on prompting, harness, and tooling — typically more than on the model itself.