Intermediate ~18 hours

Build an Eval Set for an AI Product

Build a 50-example eval set for a real AI task — extraction, classification, summarization — and score 3 models.

Python or TypeScriptOpenAI / Anthropic SDKSpreadsheet for the rubric

About this project

Evals are how AI PMs prove their work. This project teaches the methodology: task definition, gold-set construction, scoring rubric (LLM-as-judge + heuristics), and the trade-off discipline of choosing one model over another. Pick a task you care about, build 50 high-quality examples, score Claude / GPT / open-source on it.

Why build this in 2026?

Eval design is the technical skill that separates AI PMs from generic PMs. Almost nobody has shipped one.

What you'll ship

Eval set in CSV or JSON (50+ examples)

Scoring rubric

Comparison table across 3 models

Recommendation memo with reasoning

Sign up to see the full project brief

Full deliverables, success criteria, and AI Career Tutor support — free.

You'll unlock:Complete project brief, AI tutor that knows this project, and progress tracking when you start.

Skills you'll practice

large language modelsab testingproduct management