How GRADE Works
GRADE answers one question for the people who run education programs: can AI actually help you make sense of your program's data?
Most AI benchmarks test essay writing or math puzzles. GRADE instead asks the practical questions a program director, coordinator, or district leader faces every week — about attendance, tutoring, equity, and outcomes — and measures whether AI gives answers you could trust and act on.
What GRADE tests
GRADE asks AI the kinds of questions education teams really face, grouped into five subjects:
- Finding the right numbers — pulling exact counts and rates straight from program data.
- Spotting trends — summarizing how things changed over time, without overstating cause.
- Coaching & recommendations — turning data into specific, well-supported next steps.
- Equity & subgroups — reading subgroup data carefully, including when a group is too small to report on safely.
- Effectiveness & research — weighing a program's own data against published research.
You can read every question and see how each model answered on the Explorer.
The data is realistic — but completely made up
No real students, tutors, or schools appear anywhere in GRADE. Every record — attendance, sessions, surveys — is synthetic, built to behave like a real after-school program. Nothing private is ever exposed, and every model is tested on the exact same data, so the comparison is fair. You can browse the actual datasets behind any question right in the Explorer and the Arena.
How answers are scored
Every answer is judged on six things that matter to a practitioner, weighted by how much they matter:
| What we check | Weight | In plain terms |
|---|---|---|
| Accuracy | 35% | Are the numbers right and tied to the data? |
| Insight | 20% | Does it interpret the data, not just repeat it? |
| Evidence | 15% | Does it point to the specific data behind each claim? |
| Honesty about limits | 15% | Does it flag caveats — like small groups — instead of overstating? |
| Consistency | 10% | Does it give the same answer when asked again? |
| Clarity | 5% | Is it well-organized and easy to use? |
To keep it fair, each model answers every question five times and the scores are averaged, so one lucky or unlucky answer doesn't tip the result. The numbers in each answer are checked against the known-correct values, and an expert AI judge rates the rest against a consistent rubric.
The judge is always from a different company than the model being graded, so no model is scored by its own family's "house style": Claude Opus 4.8 judges every model except the Claude models, which are judged by GPT-5.5 at its highest reasoning setting. Both judges are strong frontier models, and both also appear on the leaderboard — neither ever grades itself or a sibling. Because the Claude models are scored by a different judge than everyone else, we monitor the two judges for differences in strictness and will publish any adjustment if one proves consistently harsher than the other.
What GRADE can — and can't — tell you
GRADE measures how well AI handles education-program data analysis. It does not measure general knowledge outside that work, whether acting on the advice actually improves your program, or speed and cost.
We're also upfront about today's limits: GRADE currently covers one program over a single term, so it can't speak to year-over-year change yet, and it draws on one set of program data (more variety is planned). The current version has 26 questions, so small score differences — especially within a single subject area — can come down to chance; treat close calls as a tie rather than a ranking. And the AI judge checks each answer against a vetted answer key rather than re-deriving everything from the raw data — the numbers themselves are verified separately, by code, against the known-correct values. We'd rather say so than overstate what GRADE proves.
Because GRADE's questions and data are published openly, future AI models could encounter them in training. Since all the data is synthetic, we regenerate it with each new version of the benchmark, and scores are only ever compared within the same version.
For researchers and engineers
Full technical details — scoring internals, data schemas, and reproducibility — live in the open-source repository: github.com/PearlEng/grade.