Evaluation Overview
In the active emerging era of AI, ranking AI Agents' outputs is challenging and important. At Teammately, we are taking the concept of "LLM-as-a-Judge" whereby LLMs score other LLM's outputs, providing reliable and stable ratings backed up by reasoning for each unique test case.
Using LLMs as evaluators offers a scalable, consistent, and efficient solution for assessing AI outputs. While human evaluation still plays a major part in building AI Agents, we do believe that this process can be automated with the use of LLM-as-a-judge by integrating multiple evaluation methodologies and leading language models.
Evaluation Methodologies​
Our latest framework supports multiple efficient evaluation strategies, each tailored to specific assessment needs:​
Graded Evaluations - Three-Grade Scale: Categorizes responses on scale from 0 to 2, providing a specified reasoning for custom criteria.
Pairwise Comparisons - Evaluates two genflows responses to the same request, determining which is superior based on defined criteria. This method is effective for A/B testing different models or prompt configurations before going to production.
Fact-Checking - Cross-verifies responses against trusted sources or retrieved contexts to detect hallucinations or misinformation.​
Extra: LLM-as-a-judge voting - In case your project requires as much precision as possible, it is possible to select several models from different providers to have weighted evaluation results for both Graded and Pairwise evaluations. You can select up to 6 LLM judge providers, which are constantly updated as new Large Language Models are released over time.
Model Integration​
Teammately evaluation system is compatible with leading LLMs, each bringing unique strengths to the judging process:
- OpenAI's GPT 4.5 and 4o
- Anthropic's Sonnet 3.5 and 3.7
- Google's Gemini 2.0 Pro, 1.5 Pro and 1.5 Flash
- Meta's LLama 4
- DeepSeek R1
By integrating multiple models, we enhance evaluation reliability through inter-model consensus, mitigating individual model biases and providing a huge variety for accuracy/price balance.​
Test Cases Generation​
To run the evaluation, you will need a test dataset with user inputs for the AI Agent. The Agent's responses to the dataset will be evaluated using fixed or custom metrics, depending on the Evaluation Methodology selected.
Either upload a custom dataset in .csv format or use the Test Cases Generator to create a fixed number of test cases based on the Major Use Cases from PRD.
More information about test case generation/expansion is available here.