Skip to main content
Intermediate ~18 hours

Build an Eval Set for an AI Product

Build a 50-example eval set for a real AI task — extraction, classification, summarization — and score 3 models.

Python or TypeScriptOpenAI / Anthropic SDKSpreadsheet for the rubric

About this project

Evals are how AI PMs prove their work. This project teaches the methodology: task definition, gold-set construction, scoring rubric (LLM-as-judge + heuristics), and the trade-off discipline of choosing one model over another. Pick a task you care about, build 50 high-quality examples, score Claude / GPT / open-source on it.

Why build this in 2026?

Eval design is the technical skill that separates AI PMs from generic PMs. Almost nobody has shipped one.

What you'll ship

  • Eval set in CSV or JSON (50+ examples)
Scoring rubric
Comparison table across 3 models
Recommendation memo with reasoning

Sign up to see the full project brief

Full deliverables, success criteria, and AI Career Tutor support — free.

You'll unlock:Complete project brief, AI tutor that knows this project, and progress tracking when you start.

Skills you'll practice

large language modelsab testingproduct management