At Playful Software, I worked closely with the product team to design the human-evaluation methodology and metric set for AI-generated mini-apps. Most of the design effort went into figuring out what to measure. Phrasing the rubric was the easier piece.
The work ran in weekly cycles with a small team of reviewers scoring mini-apps against the rubric. The human evaluations ran separately from the LLM-as-judge harness. The two systems were meant to complement each other rather than be tied together, since they were each strong at catching different kinds of things.
A lot of the methodology effort went into making sure the metrics themselves were trustworthy. The main mechanism was weekly quality-control sessions where reviewers walked through the calls they had made and where they were uncertain. That was where a lot of the real metric refinement happened. Edge cases would surface in those conversations, and we used them to tighten definitions or drop metrics that were not pulling their weight. We also did occasional inter-rater agreement spot-checks when we wanted a sanity read on a specific metric.
Early on, we also cross-checked the human evaluations against the LLM judge to see how the two systems aligned. We dropped that comparison fairly quickly once it was clear the two methods were surfacing different kinds of issues. They turned out to be more useful as independent signals than as a check on each other.
I also contributed to the human-evaluation UI that operationalized all of this and took on some of its ongoing maintenance.
- Partnered with product across many cycles to design the human-evaluation metric set and rubrics for AI-generated mini-apps
- Spent most of the design effort on what to measure, since that was the harder question than how to phrase the rubric
- Ran weekly evaluation rounds with a small team of reviewers plus weekly QC sessions where reviewers walked through their calls and edge cases
- Validated the metrics primarily through weekly reviewer QC discussions, with occasional inter-rater agreement spot-checks
- Cross-checked human scores against the LLM judge early on, then dropped that comparison after the two methods kept surfacing different kinds of issues
- Contributed to the supporting human-evaluation UI and took on ongoing maintenance for parts of it
Tools & Stack
Langfuse · TypeScript · Claude · Evaluation Design · Metric Definition · Rubric Development · Cross-Functional Collaboration · Human-in-the-Loop Evaluation
Demo and Access
Internal work at Playful Software. Methodology, metrics, and implementation details are proprietary.