Skip to main content

Overview

Natural language unit tests bring the clarity and rigor of software unit testing to LLM evaluation. By defining simple, testable statements about desired response qualities, teams can assess model outputs with far more precision than traditional metrics or general-purpose judges. LMUnit is a specialized model built for this paradigm—delivering state-of-the-art fine-grained evaluation, strong human alignment, and clearer, more actionable insights to help teams debug, improve, and safely deploy LLM applications.

Key Features

  • Natural language unit tests for evaluating specific response qualities.
  • State-of-the-art accuracy on fine-grained benchmarks (FLASK, BigGBench).
  • Superior to frontier models on targeted evaluation tasks.
  • High human alignment with 93.5% RewardBench accuracy.
  • Actionable feedback with more granular error detection than LM judges.
  • Easy integration into CI/CD and existing evaluation workflows.
  • Accessible to all stakeholders via readable, test-based criteria.

Getting Started

See the LMUnit How-to guide for a detailed walkthrough on how to use the LMUnit API.

Additional Resources