About GRADE

Grounded Reasoning & Analysis for Data in Education — an open benchmark for AI systems working with education program data.

What GRADE measures

GRADE evaluates how accurately and usefully AI systems analyze realistic education program data. The benchmark asks a practical question: given structured program records — attendance logs, session data, subgroup summaries, and published research references — can a model produce answers that a district analyst or program manager could trust and act on?

GRADE V1 contains 26 questions across five subjects, from direct fact retrieval to equity interpretation and research synthesis. All sample data is fully synthetic and seeded. No real students, tutors, schools, or programs are included.

Trustworthy answers require five things simultaneously: factual accuracy, evidence citation, analytical insight, calibrated humility, and consistency. GRADE is purpose-built to test all five.

Who built it

GRADE was created by Pearl, a team building AI tools for tutoring and education programs. Pearl built this benchmark internally to evaluate its own AI systems against rigorous, domain-specific standards — then open-sourced it so the broader education and AI community could use and extend it.

The benchmark is fully open source. Question definitions, sample data, scoring rubrics, and methodology documentation are all publicly available.

🔗 View on GitHub

How scoring works

Most questions are scored automatically. They have known-correct answers derived from synthetic sample data. Each model is evaluated five times per question to measure both accuracy and consistency. Results appear on the Leaderboard.

Harder, judgment-based questions also go to the Arena. For research synthesis and program effectiveness questions, the Arena shows anonymized AI responses side by side so education professionals can vote on which would be more useful in practice. Community votes produce the Arena ratings.

Full technical details are in the open-source repository: github.com/PearlEng/grade.

What's measured

Automated scores weight six dimensions:

  • Accuracy — 35%   Does the model report numbers that match the data?
  • Insight — 20%   Does the model interpret signals, not just recite them?
  • Evidence — 15%   Are claims traced to specific data sources?
  • Honesty about limits — 15%   Does the model avoid overconfident or unsupported claims?
  • Consistency — 10%   Does the same question yield the same answer across five runs?
  • Clarity — 5%   Is the response clear enough for a practitioner to scan?

Scoring internals and formulas are documented in the Methodology and in detail on GitHub.

Open benchmark principles

  • All sample data is fully synthetic — no real student or program records.
  • Question definitions, rubrics, and correct values are publicly published on GitHub.
  • Any AI system can be evaluated — GRADE is not a benchmark of any single product.
  • Leaderboard results require a published results.json artifact to appear.
  • Publication and contribution guidelines are documented in the repository.