Methodology

Benchmarks Overview

We present benchmark results as reported by authors and trusted evaluators. Scores are accompanied by sources on each model page.

MMLU

Massive Multitask Language Understanding (knowledge + reasoning)

HellaSwag

Commonsense reasoning and completion

HumanEval

Code generation pass@1

GSM8K

Grade school math word problems

How We Use Benchmarks

Benchmarks are helpful signals but not the whole story. We combine reported scores with real-world usage context, licensing, pricing, and context window to help you choose the right model for your use case.

Scores are shown with sources whenever available
Comparisons highlight strengths instead of a single composite rank
Open-source models may vary by quantization and inference setup
We plan to incorporate community evals and standardized harnesses over time

Open Leaderboards →