Methodology

Methodology

OpenSOTA is a meta-leaderboard. We do not run our own evaluations; we aggregate the strongest public ones.

Weighting

Each source has a fixed weight per category. The composite score is the weighted average over the sources that *have* scored a model. We also report a coverage percentage so newly released models with partial source coverage are visible without being unfairly buried — they're simply marked 'limited' until more sources publish their evaluations.

Model normalisation

We map every reported model identifier to a canonical name (e.g. 'claude-opus-4-7', 'Opus 4.7', 'Anthropic Claude Opus 4.7' all collapse to 'Claude Opus 4.7').

Disclaimer

Benchmarks are imperfect proxies. Production performance depends heavily on prompting, harness, and tooling — typically more than on the model itself.