LLM-as-Judge Evaluation Harness

A CLI evaluation harness that used LLM-as-judge methods to score AI-generated mini-app code and outputs at Playful Software.

At Playful Software, I built a CLI-based evaluation harness that used LLM-as-judge methods to score AI-generated mini-app code and outputs. The same command could be run by hand from a developer's machine, fired off by a comment on a PR preview deployment, or run on a nightly schedule across an established evaluation dataset.

Most of the effort went into making the judge reliable. Repeated runs over the same input had to land in roughly the same place, otherwise nothing downstream of the harness could be trusted. I drove the variance down to a level the team could actually build on, through ongoing iteration on judge prompts and the agent architectures around them.

The harness scored generated code on a set of quality dimensions, and scored generated outputs on a set of tone and personality characteristics that mattered for the product. The goal was to turn things the product team had strong opinions about into something we could actually measure across releases.


  • Built a CLI evaluation harness using LLM-as-judge methods for AI-generated mini-app code and outputs
  • Supported manual runs, automated runs triggered by PR preview deployments, and a nightly schedule across an established evaluation dataset
  • Drove repeated-run variance down to a level the team could rely on, through ongoing prompt and agent-architecture iteration
  • Scored generated code on quality dimensions and generated outputs on tone and personality characteristics

Tools & Stack
Claude · Langfuse · TypeScript · LLM-as-Judge · CLI Tooling · Prompt Engineering · Reliability & Variance Analysis · Evaluation Design


Demo and Access
Internal tool developed for Playful Software. Methodology, models, metrics, and implementation details are proprietary.