LLM-as-a-Judge

Where is this feature available?

Hobby
Full
Pro
Full
Team
Full
Self Hosted
Enterprise Edition(Enterprise)

LLM-as-a-judge is a technique to evaluate the quality of LLM applications by using an LLM as a judge. The LLM is given a trace or a dataset entry and asked to score and reason about the output. The scores and reasoning are stored as scores in Langfuse.

What are common evaluation tasks?

LLM-as-a-judge evaluation tasks can be very use-case-specific. Common tasks for which Langfuse provides prebuilt prompts are:

Hallucination
Helpfulness
Relevance
Toxicity
Correctness
Contextrelevance
Contextcorrectness
Conciseness

LLM-as-a-judge evaluators in Langfuse help to evaluate:

Production/development traces
Experiments that you run on datasets

Alternatively, you can run any custom evaluation functions or packages on Langfuse data via the API/SDKs.

Custom end-to-end example: External evaluation pipeline.

Video Walkthrough

Model-based Evaluations in Langfuse

Get Started

Configure LLM provider

Langfuse supports a variety of LLM providers including OpenAI, Anthropic, Azure OpenAI, and AWS Bedrock.

To use LLM-as-a-judge, you have to configure your LLM provider in the Langfuse project settings.

Note: tool/function calling needs to be supported by the model for LLM-as-a-judge to work.

Create an LLM-as-a-judge template

LLM-as-a-judge uses a prompt template and model configuration to evaluate traces. In Langfuse this configuration is stored in an Evaluator Template as it can be reused across multiple evaluators.

To help get you started, Langfuse includes a set of predefined prompts for common evaluation tasks, but you can also write your own or customize the Langfuse-provided prompts.

Prompt templates contain {{variables}} that are substituted with actual data when an evaluator is run. You can create an arbitrary number of custom variables that can later be referenced when creating the evaluator. Common variables are input, output, context, ground_truth, etc.

Langfuse uses function/tool calling to extract the evaluation output. At the bottom of the form, you can configure score and reasoning variables which will be used to instruct the LLM on how to score and reason about the evaluation.

Set up an evaluator

Now that you have created an evaluator template, you can configure on what data it should be applied by Langfuse.

Here we need to configure the following aspects:

Which Evaluator Template to use
Trigger: On what incoming data should the evaluator be executed?
- Traces: On new traces that are ingested into Langfuse. You can configure filters to select a subset of traces.
- Datasets: On all experiments run on a specific dataset in offline development.
Name of the scores which will be created as a result of the evaluation.
Specify how Langfuse should fill the {{variables}} in the template.
- Langfuse traces can be deeply nested (see conceptual overview). You can query from the trace directly, or from any nested observation via its name.
- Select whether to use the Input, Output, or metadata value.
Optional: Add sampling to reduce costs when running evaluations on a large volume of production data.
Optional: Configure custom delay. This is how you can ensure all data arrived at Langfuse servers before evaluation is executed. The time starts when the trace is first added to Langfuse while it might be still in progress. This is especially important for long-running agent executions.

✨ Done! You have created an evaluator which will now automatically be executed on all data that matches the selected trigger.

Langfuse

Monitoring of Evaluators

Each evaluator has its own log page where you can view the progress and logs to potentially debug any issues.

Langfuse

GitHub Discussions

Human Annotation Custom via SDKs/API

Was this page useful?

Questions? We're here to help

GitHub Q&AEmail Talk to sales