Case study 03

Evaluation that explains creative failure.

Enterprise teams need to choose which generated outputs are usable, repeatable, and aligned with the brief. A clip can look impressive and still fail motion, identity, instruction, or brand review.

Open implementation See demo script

The workflow.

Evaluation becomes useful when it helps a creative team decide what to change next, not just when it produces a benchmark number.

Collect candidate videos from multiple prompts, models, or workflow settings.
Evaluate instruction fit, temporal consistency, motion quality, physical plausibility, and creative readiness.
Produce a scorecard that highlights strengths, weaknesses, and likely failure reasons.
Review results with creative stakeholders.
Feed insights back into prompting, references, controls, or model selection.

Technical exploration

The related repo reframes video model evaluation as creative QA: side-by-side comparison, failure labels, and plain-language rationale.

Demo moments

Three clips generated for the same brief.
A polished clip that fails physics or continuity.
A less flashy clip that better satisfies production needs.
A scorecard explaining the recommended next iteration.

"Here is how your team can compare outputs with more than taste alone: where the model failed, why it matters, and what to change next."

Product insight.

This workflow shows which failure labels are meaningful to creative teams, when automated metrics help or distract, and which failure modes should become product controls.