Methodology

Benchmarks Overview

We present benchmark results as reported by authors and trusted evaluators. Scores are accompanied by sources on each model page.

MMLU

Massive Multitask Language Understanding (knowledge + reasoning)

HellaSwag

Commonsense reasoning and completion

HumanEval

Code generation pass@1

GSM8K

Grade school math word problems

How We Use Benchmarks

Benchmarks are helpful signals but not the whole story. We combine reported scores with real-world usage context, licensing, pricing, and context window to help you choose the right model for your use case.

  • Scores are shown with sources whenever available
  • Comparisons highlight strengths instead of a single composite rank
  • Open-source models may vary by quantization and inference setup
  • We plan to incorporate community evals and standardized harnesses over time