Build an Eval Set for an AI Product
Build a 50-example eval set for a real AI task — extraction, classification, summarization — and score 3 models.
Python or TypeScriptOpenAI / Anthropic SDKSpreadsheet for the rubric
About this project
Evals are how AI PMs prove their work. This project teaches the methodology: task definition, gold-set construction, scoring rubric (LLM-as-judge + heuristics), and the trade-off discipline of choosing one model over another. Pick a task you care about, build 50 high-quality examples, score Claude / GPT / open-source on it.
Why build this in 2026?
Eval design is the technical skill that separates AI PMs from generic PMs. Almost nobody has shipped one.
What you'll ship
- Eval set in CSV or JSON (50+ examples)
Scoring rubric
Comparison table across 3 models
Recommendation memo with reasoning
Sign up to see the full project brief
Full deliverables, success criteria, and AI Career Tutor support — free.
You'll unlock:Complete project brief, AI tutor that knows this project, and progress tracking when you start.
Skills you'll practice
large language modelsab testingproduct management