Let’s connect to the running Label Studio instance.
For information on finding your base URL and API key, see Authenticate and Connect to the API.
Also see this example notebook.
Label Studio Enterprise users can now generate chats with an LLM directly from the Label Studio interface. For more information, see the Chat tag.
There are several strategies to evaluate LLM responses, depending on the complexity of the system and specific evaluation goals.
The simplest form of LLM system evaluation is to moderate a single response generated by the LLM. When user interacts with the model, you can import user prompt and the model response into Label Studio UI designed for response moderation task.
Let’s create a project for moderating LLM responses.
To create a project, you need to specify the label_config that defines the labeling interface and the labels
ontology.
After config is defined, create a new project with:
To create evaluation tasks from LLM responses and import it into the created Label Studio project, you can follow this format:
For example, you can obtain the response from the OpenAI API:
Ensure you have the OpenAI API key set in the environment variable OPENAI_API_KEY.
Sometimes it is useful to assign a grade to the LLM response based on the quality of the generated text.
Let’s create a project for grading LLM summarization capabilities. Copy the following template to create a project in Label Studio:
Use this configuration to create a new project:
The LLM responses should be collected in the following format:
Sometimes you need to compare two different model responses or compare the model response with the ground truth.
Let’s create a project for side-by-side comparison of LLM responses.
Copy the following configuration:
Use this configuration to create a new project:
Use the following format to get LLM responses and import them into the created Label Studio project:
Read more about Label Studio template for pairwise comparison.
When dealing with RAG (Retrieval-Augmented Generation) pipeline, you goal is not only evaluating a single LLM response, but also incorporating various assessments of the retrieved documents like contextual and answer relevancy and faithfulness.
Let’s start with creating a Label Studio interface to visualize and evaluate various aspects of RAG pipeline.
Here we present a simple configuration that aims to evaluate:
Copy the following template:
Copy the configuration and create a new project replacing label_config with the provided configuration:
Here is an example of task data format to be imported into Label Studio project:
For example, you can collect such data using the LlamaIndex framework.
We will use RAG pipeline to answer user queries regarding GitHub issues:
Now we can construct the task that can be directly imported in Label Studio project given the labeling configuration described above:
Picking one of the provided evaluation strategies, you can now upload your task to created Label Studio project:
Now open the Label Studio UI and navigate to http://localhost:8080/projects/{project.id}/data?labeling=1 to start LLM evaluation.
The final step is to collect the annotated data from the Label Studio project. You can export the annotations in various formats like JSON, CSV, or directly to cloud storage providers.
You can also use a Python SDK to retrieve the annotations. For example, to collect and display all user choices from the project: