How often each model answers education-data questions correctly, measured against known-correct answers.

# Model Overall Accuracy Consistency Retrieval Trends Coaching Equity Research Cost / Run
1 GPT-5.5 (xhigh) 76.9%
89.1% 84.4% 66.3% 86.9% 67.6% 85.7% 79.9% $44.61
2 GPT-5.5 (medium) 76.7%
87.4% 85.7% 66.7% 90.0% 62.8% 85.3% 80.7% $27.98
3 GPT-5.5 (high) 75.5%
86.2% 84.3% 67.7% 86.6% 62.5% 85.1% 77.1% $37.13
4 Gemini 3.1 Pro 70.7%
78.8% 85.5% 66.1% 80.5% 52.9% 83.7% 63.8% $40.11
5 GPT-5.5 (low) 70.5%
74.9% 84.9% 55.8% 90.1% 48.4% 82.8% 78.4% $18.97
6 Gemini 3.5 Flash 69.8%
78.9% 84.5% 59.1% 81.2% 63.9% 81.4% 64.6% $25.19
7 GPT-OSS 120B 66.4%
66.1% 95.7% 28.2% 84.0% 19.1% 74.3% 65.7% $0.04
8 Claude Opus 4.8 63.3%
65.3% 89.6% 47.9% 86.2% 34.7% 82.2% 68.5% $42.59
9 Nemotron 3 Ultra 550B 59.1%
62.2% 86.3% 32.1% 88.3% 16.7% 87.2% 76.4% $6.92
10 Claude Sonnet 4.6 58.6%
58.7% 90.2% 43.1% 83.0% 32.6% 76.7% 60.7% $23.89
11 Claude Haiku 4.5 56.5%
62.9% 87.6% 40.1% 76.7% 35.1% 75.7% 58.1% $7.72

Overall = 0.35·Accuracy + 0.20·Insight + 0.15·Evidence + 0.15·Honesty about limits + 0.10·Consistency + 0.05·Clarity