TaxEval
Ever wondered if an LLM can do your taxes? In this example, we'll show you how to investigate the results of the taxeval benchmark. While we'll go through how to upload the results to Zeno in this example, please refer to the original repository if you want to run the benchmark on your own models.
Dependencies
Let's start by installing the required dependencies for this project:
pip install pandas zeno-client python-dotenv bertopic
Imports
After this is all set up, we can now start running our analysis code and uploading data to Zeno. We'll first import relevant libraries which we're going to use:
import pandas as pd
import json
import os
from dotenv import load_dotenv
from zeno_client import ZenoClient, ZenoMetric
Loading the Data
We'll be uploading already processed model outputs. You can get these here, alternatively, you can run your own models following this README.
data = json.load(open("tax-benchmark.json"))
Formatting the Data
Before we can upload the data to Zeno, we'll need to convert it to a pandas DataFrame and will format some of the data so that we have all the information we're going to need in our Zeno project.
def format_question(input):
return_question = input["source_question"]["description"].replace("\\n", "\n")
return_question += "\n\n"
for answer in enumerate(input["source_question"]["options"]):
return_question += f"{answer[0] + 1}. {answer[1]}\n"
return return_question
df_input = pd.DataFrame(
{
"question": [format_question(d) for d in data],
"answer": [str(d["source_question"]["correct_answer"]) for d in data],
"reference": [d["source_question"]["reference"] for d in data],
"tag": [d["source_question"]["tag"] for d in data],
"category": [d["source_question"]["category"] for d in data],
}
)
df_input["question length"] = df_input["question"].apply(lambda x: len(x))
df_input["id"] = df_input.index
Optional: Generate Topics
If you want you can also add topic information to the data using bertopic as follows:
from bertopic import BERTopic
topic_model = BERTopic("english", min_topic_size=3)
topics, probs = topic_model.fit_transform(
[d["source_question"]["description"] for d in data]
)
df_input["topic"] = topics
df_input["topic"] = df_input["topic"].astype(str)
Create a Zeno Project
We can now upload our data to a Zeno project.
You will need your ZENO_API_KEY
here, which you can generate by clicking on your profile at the top right to navigate to your account page.
Once you have your API key, you can authenticate with the Zeno client and create a project as follows:
client = zeno_client.ZenoClient(YOUR_API_KEY)
project = client.create_project(
name="LLM Taxes Benchmark",
view={
"data": {"type": "markdown"},
"label": {"type": "text"},
"output": {"type": "markdown"},
},
description="Tax questions for LLMs",
public=True,
metrics=[
ZenoMetric(name="accuracy", type="mean", columns=["correct"]),
ZenoMetric(name="output length", type="mean", columns=["output length"]),
],
)
Uploading Data
Now that all of this is set up, we can upload the data that we've formatted into a DataFrame before.
project.upload_dataset(
df_input, id_column="id", data_column="question", label_column="answer"
)
Uploading System outputs
Finally, we're going to collect all the model outputs and upload each of the models' results as systems of our Zeno project.
for model in data[0]["full"].keys():
df_system = pd.DataFrame(
{
"output": [
f"**Full:** {d['full'][model]}\n\n**Simplified**: {d['simplified'][model]}"
for d in data
],
"output length": [len(d["full"][model]) for d in data],
"simplified output": [str(d["simplified"][model]) for d in data],
}
)
df_system["correct"] = df_input["answer"] == df_system["simplified output"]
df_system["id"] = df_input["id"]
project.upload_system(
df_system, name=model.replace("/", "-"), id_column="id", output_column="output"
)