Skip to main content

Evaluation Methodology: Grading

Method Overview​

Every test case is being evaluated to 3 groups:

  • Fully Meets Requirements (2) - The Agent's response contains all the necessary details described in the metric and does not contain anything that should not be there
  • Partially Meets Requirements (1) - The Agent's response contains everything that is critical for the given metric, but some necessary part is missing. Or, something is present that ideally should not be there according to the described metric
  • Never Meets Requirements (0) - The Agent's response response is missing some critical information required for this metric, contains significant errors, or includes misleading information.

Importance of Well-Defined Evaluation Metrics​

The effectiveness of the Three-Grade Evaluation method is heavily dependent on the clarity and precision of the evaluation metrics used. Well-defined metrics ensure that evaluators have a consistent standard against which to assess LLM outputs, reducing ambiguity and enhancing the reliability of evaluations.​

For guidance on developing robust evaluation metrics, please refer to the Metrics Section in the Product Requirements Document (PRD).​

Enhancing Precision with LLM Judge Voting​

To improve the accuracy and objectivity of evaluations, the Three-Grade Evaluation framework can incorporate a voting mechanism involving multiple LLM judges. In this setup, up to five LLMs independently assess each response based on the established evaluation metrics.​

Diagram illustrating the voting system for three-grade evaluation

The individual scores from each LLM judge are then aggregated by calculating a weighted average to determine the final grade for the response. Additionally, the reasoning provided by each judge is compiled to offer comprehensive insights into the evaluation process.​

This multi-judge approach mitigates individual biases and leverages the diverse strengths of different LLMs, resulting in more balanced and reliable evaluations.​