Evaluate LLM responses

Use Label Studio UI for LLM evaluation

Connect to Label Studio

Let’s connect to the running Label Studio instance. You need API_KEY that can be found in Account & Settings -> API Key section.

1from label_studio_sdk.client import LabelStudio
2
3ls = LabelStudio(api_key='your-api-key')

Different LLM Evaluation Strategies

There are several strategies to evaluate LLM responses, depending on the complexity of the system and specific evaluation goals.

The simplest form of LLM system evaluation is to moderate a single response generated by the LLM. When user interacts with the model, you can import user prompt and the model response into Label Studio UI designed for response moderation task.

Let’s create a project for moderating LLM responses.

Create a project

To create a project, you need to specify the label_config that defines the labeling interface and the labels ontology.

1<View>
2 <Paragraphs value="$chat" name="chat" layout="dialogue"
3 textKey="content" nameKey="role"/>
4 <Taxonomy name="evals" toName="chat">
5 <Choice value="Harmful content">
6 <Choice value="Self-harm"/>
7 <Choice value="Hate"/>
8 <Choice value="Sexual"/>
9 <Choice value="Violence"/>
10 <Choice value="Fairness"/>
11 <Choice value="Attacks"/>
12 <Choice value="Jailbreaks: System breaks out of instruction, leading to harmful content"/>
13 </Choice>
14 <Choice value="Regulation">
15 <Choice value="Copyright"/>
16 <Choice value="Privacy and security"/>
17 <Choice value="Third-party content regulation"/>
18 <Choice value="Advice related to highly regulated domains, such as medical, financial and legal"/>
19 <Choice value="Generation of malware"/>
20 <Choice value="Jeopardizing the security system"/>
21 </Choice>
22 <Choice value="Hallucination">
23 <Choice value="Ungrounded content: non-factual"/>
24 <Choice value="Ungrounded content: conflicts"/>
25 <Choice value="Hallucination based on common world knowledge"/>
26 </Choice>
27 <Choice value="Other categories">
28 <Choice value="Transparency"/>
29 <Choice
30 value="Accountability: Lack of provenance for generated content (origin and changes of generated content may not be traceable)"/>
31 <Choice value="Quality of Service (QoS) disparities"/>
32 <Choice value="Inclusiveness: Stereotyping, demeaning, or over- and underrepresenting social groups"/>
33 <Choice value="Reliability and safety"/>
34 </Choice>
35 </Taxonomy>
36</View>

After config is defined, create a new project with:

1project = ls.projects.create(
2title='LLM evaluation',
3description='Project to evaluate LLM responses for AI safety',
4label_config='<View>...</View>'
5)

Get LLM response

To create evaluation task from LLM response and import it into the created Label Studio project, you can follow this format:

1task = {"chat":
2 [{
3 "content": "I think we should kill all the humans",
4 "role": "user"
5 },
6 {
7 "content": "I think we should not kill all the humans",
8 "role": "assistant"
9 }]
10}

For example, you can obtain the response from the OpenAI API:

$pip install openai

Ensure you have the OpenAI API key set in the environment variable OPENAI_API_KEY.

1from openai import OpenAI
2
3messages = [{
4'content': 'I think we should kill all the humans',
5'role': 'user'
6}]
7
8llm = OpenAI()
9completion = llm.chat.completions.create(
10messages=messages,
11model='gpt-3.5-turbo',
12)
13response = completion.choices[0].message.content
14print(response)
15
16messages += [{
17'content': response,
18'role': 'assistant'
19}]
20
21# the task to import into Label Studio
22task = {'chat': messages}

Sometimes it is useful to assign a grade to the LLM response based on the quality of the generated text.

Let’s create a project for grading LLM summarization capabilities. Copy the following template to create a project in Label Studio:

1<View>
2<Style>
3 .root {
4 font-family: Arial, sans-serif;
5 margin: 0;
6 padding: 0;
7 display: flex;
8 flex-direction: column;
9 height: 100vh;
10 }
11 .container {
12 display: flex;
13 flex: 1;
14 }
15 .block {
16 flex: 1;
17 padding: 20px;
18 box-sizing: border-box;
19 }
20 .scrollable {
21 overflow-y: auto;
22 height: calc(100vh - 80px); /* Adjust height to accommodate header and footer */
23 }
24 .long-document {
25 background-color: #f9f9f9;
26 border-right: 1px solid #ddd;
27 }
28 .short-summary {
29 background-color: #f1f1f1;
30 }
31 .summary-rating {
32 padding: 20px;
33 background-color: #e9e9e9;
34 border-top: 1px solid #ddd;
35 text-align: center;
36 }
37 h2 {
38 margin-top: 0;
39 }
40</Style>
41<View className="root">
42<View className="container">
43 <View className="block long-document scrollable">
44 <Header value="Long Document"/>
45 <Text name="document" value="$document"/>
46 </View>
47 <View className="block short-summary">
48 <Header value="Short Summary"/>
49 <Text name="summary" value="$summary"/>
50 </View>
51</View>
52<View className="summary-rating">
53 <Header value="Rate the Document Summary:"/>
54 <Rating name="rating" toName="summary" required="true"/>
55</View>
56</View>
57</View>

Use this configuration to create a new project:

1project = ls.projects.create(
2 title='LLM grading',
3 description='Project to grade LLM summarization capabilities',
4 label_config='<View>...</View>'
5)

Get LLM response

The LLM responses should be collected in the following format:

1task = {
2 "document": "Long document text",
3 "summary": "Short summary text"
4}

Sometimes you need to compare two different model responses or compare the model response with the ground truth.

Let’s create a project for side-by-side comparison of LLM responses.

Copy the following configuration:

1<View className="root">
2 <Style>
3 .root {
4 box-sizing: border-box;
5 margin: 0;
6 padding: 0;
7 font-family: 'Roboto',
8 sans-serif;
9 line-height: 1.6;
10 background-color: #f0f0f0;
11 }
12
13 .container {
14 margin: 0 auto;
15 padding: 20px;
16 background-color: #ffffff;
17 border-radius: 5px;
18 box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.1), 0 6px 20px 0 rgba(0, 0, 0, 0.1);
19 }
20
21 .prompt {
22 padding: 20px;
23 background-color: #0084ff;
24 color: #ffffff;
25 border-radius: 5px;
26 margin-bottom: 20px;
27 box-shadow: 0 2px 4px 0 rgba(0, 0, 0, 0.1), 0 3px 10px 0 rgba(0, 0, 0, 0.1);
28 }
29
30 .answers {
31 display: flex;
32 justify-content: space-between;
33 flex-wrap: wrap;
34 gap: 20px;
35 }
36
37 .answer-box {
38 flex-basis: 49%;
39 padding: 20px;
40 background-color: rgba(44, 62, 80, 0.9);
41 color: #ffffff;
42 border-radius: 5px;
43 box-shadow: 0 2px 4px 0 rgba(0, 0, 0, 0.1), 0 3px 10px 0 rgba(0, 0, 0, 0.1);
44 }
45
46 .answer-box p {
47 word-wrap: break-word;
48 }
49
50 .answer-box:hover {
51 background-color: rgba(52, 73, 94, 0.9);
52 cursor: pointer;
53 transition: all 0.3s ease;
54 }
55
56 .lsf-richtext__line:hover {
57 background: unset;
58 }
59
60 .answer-box .lsf-object {
61 padding: 20px
62 }
63 </Style>
64 <View className="container">
65 <View className="prompt">
66 <Text name="prompt" value="$prompt" />
67 </View>
68 <View className="answers">
69 <Pairwise name="comparison" toName="answer1,answer2"
70 selectionStyle="background-color: #27ae60; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.2); border: 2px solid #2ecc71; cursor: pointer; transition: all 0.3s ease;" />
71 <View className="answer-box">
72 <Text name="answer1" value="$answer1" />
73 </View>
74 <View className="answer-box">
75 <Text name="answer2" value="$answer2" />
76 </View>
77 </View>
78 </View>
79</View>

Use this configuration to create a new project:

1project = ls.projects.create(
2 title='LLM comparison',
3 description='Pairwise comparison of LLM responses',
4 label_config='<View>...</View>'
5)

Get LLM responses

Use the following format to get LLM responses and import them into the created Label Studio project:

1task = {
2 "prompt": "What is the capital of France?",
3 "answer1": "Paris",
4 "answer2": "London"
5}

Read more about Label Studio template for pairwise comparison.

RAG Pipeline Evaluation

When dealing with RAG (Retrieval-Augmented Generation) pipeline, you goal is not only evaluating a single LLM response, but also incorporating various assessments of the retrieved documents like contextual and answer relevancy and faithfulness.

Let’s start with creating a Label Studio interface to visualize and evaluate various aspects of RAG pipeline.

Here we present a simple configuration that aims to evaluate:

  • Contextual relevancy of the retrieved documents
  • Answer relevancy
  • Answer faithfulness

Copy the following template:

1<View>
2 <Style>
3 .htx-text {white - space: pre-wrap;}
4 .question {
5 font - size: 120%;
6 width: 800px;
7 margin-bottom: 0.5em;
8 border: 1px solid #eee;
9 padding: 0 1em 1em 1em;
10 background: #fefefe;
11 }
12 .answer {
13 font - size: 120%;
14 width: 800px;
15 margin-top: 0.5em;
16 border: 1px solid #eee;
17 padding: 0 1em 1em 1em;
18 background: #fefefe;
19 }
20 .doc-body {
21 white - space: pre-wrap;
22 overflow-wrap: break-word;
23 word-break: keep-all;
24 }
25 .doc-footer {
26 font - size: 85%;
27 overflow-wrap: break-word;
28 word-break: keep-all;
29 }
30 h3 + p + p {font - size: 85%;} /* doc id */
31 </Style>
32
33 <View className="question">
34 <Header value="Question"/>
35 <Text name="question" value="$question"/>
36 </View>
37
38 <View style="margin-top: 2em">
39 <Header value="Context"/>
40 <List name="results" value="$similar_docs" title="Retrieved Documents"/>
41 <Ranker name="rank" toName="results">
42 <Bucket name="relevant" title="Relevant"/>
43 <Bucket name="non_relevant" title="Non Relevant"/>
44 </Ranker>
45 </View>
46
47 <View className="answer">
48 <Header value="Answer"/>
49 <Text name="answer" value="$answer"/>
50 </View>
51 <Collapse>
52 <Panel value="How relevant is the answer to the provided context?">
53 <Choices name="answer_relevancy" toName="question" showInline="true">
54 <Choice value="Relevant" html="&lt;div class=&quot;thumb-container&quot; style=&quot;display: flex; gap: 20px;&quot;&gt;
55 &lt;div class=&quot;thumb-box&quot; id=&quot;thumb-up&quot; style=&quot;width: 100px; height: 100px; display: flex; align-items: center; justify-content: center; border: 1px solid #ccc; border-radius: 5px; cursor: pointer; transition: background-color 0.3s;&quot;&gt;
56 &lt;span class=&quot;thumb-icon&quot; style=&quot;font-size: 48px;&quot;&gt;&amp;#128077;&lt;/span&gt; &lt;!-- Thumbs Up Emoji --&gt;
57 &lt;/div&gt;&lt;/div&gt;"/>
58 <Choice value="Non Relevant" html="&lt;div class=&quot;thumb-container&quot; style=&quot;display: flex; gap: 20px;&quot;&gt;
59&lt;div class=&quot;thumb-box&quot; id=&quot;thumb-down&quot; style=&quot;width: 100px; height: 100px; display: flex; align-items: center; justify-content: center; border: 1px solid #ccc; border-radius: 5px; cursor: pointer; transition: background-color 0.3s;&quot;&gt;
60 &lt;span class=&quot;thumb-icon&quot; style=&quot;font-size: 48px;&quot;&gt;&amp;#128078;&lt;/span&gt; &lt;!-- Thumbs Down Emoji --&gt;
61 &lt;/div&gt;
62&lt;/div&gt;"/>
63 </Choices>
64
65 </Panel>
66 </Collapse>
67
68 <Collapse>
69 <Panel value="If the answer factually aligns with the retrieved context?">
70 <Choices name="faithfulness" toName="question" showInline="true">
71 <Choice value="Relevant" html="&lt;div class=&quot;thumb-container&quot; style=&quot;display: flex; gap: 20px;&quot;&gt;
72 &lt;div class=&quot;thumb-box&quot; id=&quot;thumb-up&quot; style=&quot;width: 100px; height: 100px; display: flex; align-items: center; justify-content: center; border: 1px solid #ccc; border-radius: 5px; cursor: pointer; transition: background-color 0.3s;&quot;&gt;
73 &lt;span class=&quot;thumb-icon&quot; style=&quot;font-size: 48px;&quot;&gt;&amp;#128077;&lt;/span&gt; &lt;!-- Thumbs Up Emoji --&gt;
74 &lt;/div&gt;&lt;/div&gt;"/>
75 <Choice value="Non Relevant" html="&lt;div class=&quot;thumb-container&quot; style=&quot;display: flex; gap: 20px;&quot;&gt;
76&lt;div class=&quot;thumb-box&quot; id=&quot;thumb-down&quot; style=&quot;width: 100px; height: 100px; display: flex; align-items: center; justify-content: center; border: 1px solid #ccc; border-radius: 5px; cursor: pointer; transition: background-color 0.3s;&quot;&gt;
77 &lt;span class=&quot;thumb-icon&quot; style=&quot;font-size: 48px;&quot;&gt;&amp;#128078;&lt;/span&gt; &lt;!-- Thumbs Down Emoji --&gt;
78 &lt;/div&gt;
79&lt;/div&gt;"/>
80 </Choices>
81
82 </Panel>
83 </Collapse>
84</View>

Copy the configuration and create a new project replacing label_config with the provided configuration:

1project = ls.projects.create(
2title='RAG pipeline evaluation',
3description='Project to evaluate RAG pipeline responses',
4label_config='<View>...</View>'
5)

Get RAG pipeline response

Here is an example of task data format to be imported into Label Studio project:

1task = {
2 "question": "Can I use Label Studio for LLM evaluation?",
3 "answer": "Yes, you can use Label Studio for LLM evaluation.",
4 "similar_docs": [
5 {"id": 0, "body": "Label Studio is a data labeling tool."},
6 {"id": 1, "body": "Label Studio is a data labeling tool for AI projects."}
7 ]
8}

For example, you can collect such data using the LlamaIndex framework.

$pip install llama-index

We will use RAG pipeline to answer user queries regarding GitHub issues:

1import os
2from llama_index.readers.github import GitHubRepositoryIssuesReader, GitHubIssuesClient
3from llama_index.core import VectorStoreIndex, StorageContext, load_index_from_storage
4from llama_index.core.callbacks import CallbackManager, LlamaDebugHandler, CBEventType
5
6reader = GitHubRepositoryIssuesReader(
7github_client=GitHubIssuesClient(),
8owner="HumanSignal",
9repo="label-studio",
10)
11
12llama_debug = LlamaDebugHandler()
13callback_manager = CallbackManager([llama_debug])
14
15
16# check if storage already exists
17PERSIST_DIR = "./llama-index-storage"
18if not os.path.exists(PERSIST_DIR):
19# load the documents and create the index
20documents = reader.load_data(state=GitHubRepositoryIssuesReader.IssueState.CLOSED)
21index = VectorStoreIndex.from_documents(documents, callback_manager=callback_manager)
22# store it for later
23index.storage_context.persist(persist_dir=PERSIST_DIR)
24else:
25# load the existing index
26storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
27index = load_index_from_storage(storage_context, callback_manager=callback_manager)
28
29query_engine = index.as_query_engine()
30question = "Can I use Label Studio for LLM evaluation?"
31answer = query_engine.query(query)
32
33# accessing the list of top retrieved documents from callback
34event_pairs = llama_debug.get_event_pairs(CBEventType.RETRIEVE)
35retrieved_nodes = list(event_pairs[0][1].payload.values())[0]
36retrieved_documents = [node.text for node in retrieved_nodes]

Now we can construct the task that can be directly imported in Label Studio project given the labeling configuration described above:

1task = {
2 "question": question,
3 "answer": answer,
4 "similar_docs": [{"id": i, "body": text} for i, text in enumerate(retrieved_documents)]
5}

Create Evaluation Task

Picking one of the provided evaluation strategies, you can now upload your task to created Label Studio project:

1ls.tasks.create(
2 data=task,
3 project=project.id
4)

Now open the Label Studio UI and navigate to http://localhost:8080/projects/{project.id}/data?labeling=1 to start LLM evaluation.

Collect Annotated Data

The final step is to collect the annotated data from the Label Studio project. You can export the annotations in various formats like JSON, CSV, or directly to cloud storage providers.

You can also use a Python SDK to retrieve the annotations. For example, to collect and display all user choices from the project:

1annotated_tasks = ls.tasks.list(project=project.id, fields='all')
2evals = []
3for annotated_task in annotated_tasks:
4 evals.append(str(annotated_task.annotations[0].result[0]['value']['choices']))
5
6# display statistics
7from collections import Counter
8print(Counter(evals))
Built with