Evaluate LLM responses
Use Label Studio UI for LLM evaluation
Connect to Label Studio
Let’s connect to the running Label Studio instance. You need API_KEY
that can be found in Account & Settings
-> API Key
section.
Different LLM Evaluation Strategies
There are several strategies to evaluate LLM responses, depending on the complexity of the system and specific evaluation goals.
Response Moderation
The simplest form of LLM system evaluation is to moderate a single response generated by the LLM. When user interacts with the model, you can import user prompt and the model response into Label Studio UI designed for response moderation task.
Let’s create a project for moderating LLM responses.
Create a project
To create a project, you need to specify the label_config
that defines the labeling interface and the labels
ontology.
After config is defined, create a new project with:
Get LLM response
To create evaluation task from LLM response and import it into the created Label Studio project, you can follow this format:
For example, you can obtain the response from the OpenAI API:
Ensure you have the OpenAI API key set in the environment variable OPENAI_API_KEY
.
Response Grading
Sometimes it is useful to assign a grade to the LLM response based on the quality of the generated text.
Let’s create a project for grading LLM summarization capabilities. Copy the following template to create a project in Label Studio:
Use this configuration to create a new project:
Get LLM response
The LLM responses should be collected in the following format:
Side-by-Side Comparison
Sometimes you need to compare two different model responses or compare the model response with the ground truth.
Let’s create a project for side-by-side comparison of LLM responses.
Copy the following configuration:
Use this configuration to create a new project:
Get LLM responses
Use the following format to get LLM responses and import them into the created Label Studio project:
Read more about Label Studio template for pairwise comparison.
Evaluating RAG Pipeline
RAG Pipeline Evaluation
When dealing with RAG (Retrieval-Augmented Generation) pipeline, you goal is not only evaluating a single LLM response, but also incorporating various assessments of the retrieved documents like contextual and answer relevancy and faithfulness.
Let’s start with creating a Label Studio interface to visualize and evaluate various aspects of RAG pipeline.
Here we present a simple configuration that aims to evaluate:
- Contextual relevancy of the retrieved documents
- Answer relevancy
- Answer faithfulness
Copy the following template:
Copy the configuration and create a new project replacing label_config
with the provided configuration:
Get RAG pipeline response
Here is an example of task data format to be imported into Label Studio project:
For example, you can collect such data using the LlamaIndex framework.
We will use RAG pipeline to answer user queries regarding GitHub issues:
Now we can construct the task that can be directly imported in Label Studio project given the labeling configuration described above:
Create Evaluation Task
Picking one of the provided evaluation strategies, you can now upload your task
to created Label Studio project
:
Now open the Label Studio UI and navigate to http://localhost:8080/projects/{project.id}/data?labeling=1
to start LLM evaluation.
Collect Annotated Data
The final step is to collect the annotated data from the Label Studio project. You can export the annotations in various formats like JSON, CSV, or directly to cloud storage providers.
You can also use a Python SDK to retrieve the annotations. For example, to collect and display all user choices from the project: