Leaderboard
AI models rated two ways: a benchmark score for how often a model answers education-data questions correctly, and a community rating for how helpful educators find its answers.
How often each model answers education-data questions correctly, measured against known-correct answers.
How educators rank models in head-to-head Arena votes, broken down by question type — separate from the benchmark scores.
| # | Model | Overall | Accuracy | Consistency | Retrieval | Trends | Coaching | Equity | Research | Cost / Run |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GPT-5.5 (xhigh) | 76.9% | 89.1% | 84.4% | 66.3% | 86.9% | 67.6% | 85.7% | 79.9% | $44.61 |
| 2 | GPT-5.5 (medium) | 76.7% | 87.4% | 85.7% | 66.7% | 90.0% | 62.8% | 85.3% | 80.7% | $27.98 |
| 3 | GPT-5.5 (high) | 75.5% | 86.2% | 84.3% | 67.7% | 86.6% | 62.5% | 85.1% | 77.1% | $37.13 |
| 4 | Gemini 3.1 Pro | 70.7% | 78.8% | 85.5% | 66.1% | 80.5% | 52.9% | 83.7% | 63.8% | $40.11 |
| 5 | GPT-5.5 (low) | 70.5% | 74.9% | 84.9% | 55.8% | 90.1% | 48.4% | 82.8% | 78.4% | $18.97 |
| 6 | Gemini 3.5 Flash | 69.8% | 78.9% | 84.5% | 59.1% | 81.2% | 63.9% | 81.4% | 64.6% | $25.19 |
| 7 | GPT-OSS 120B | 66.4% | 66.1% | 95.7% | 28.2% | 84.0% | 19.1% | 74.3% | 65.7% | $0.04 |
| 8 | Claude Opus 4.8 | 63.3% | 65.3% | 89.6% | 47.9% | 86.2% | 34.7% | 82.2% | 68.5% | $42.59 |
| 9 | Nemotron 3 Ultra 550B | 59.1% | 62.2% | 86.3% | 32.1% | 88.3% | 16.7% | 87.2% | 76.4% | $6.92 |
| 10 | Claude Sonnet 4.6 | 58.6% | 58.7% | 90.2% | 43.1% | 83.0% | 32.6% | 76.7% | 60.7% | $23.89 |
| 11 | Claude Haiku 4.5 | 56.5% | 62.9% | 87.6% | 40.1% | 76.7% | 35.1% | 75.7% | 58.1% | $7.72 |
Overall = 0.35·Accuracy + 0.20·Insight + 0.15·Evidence + 0.15·Honesty about limits + 0.10·Consistency + 0.05·Clarity
Question type: