Skip to main content

TaxEval

Open with Zeno Open Notebook

Ever wondered if an LLM can do your taxes? In this example, we'll show you how to investigate the results of the taxeval benchmark. While we'll go through how to upload the results to Zeno in this example, please refer to the original repository if you want to run the benchmark on your own models.

Dependencies

Let's start by installing the required dependencies for this project:

pip install pandas zeno-client python-dotenv bertopic

Imports

After this is all set up, we can now start running our analysis code and uploading data to Zeno. We'll first import relevant libraries which we're going to use:

import pandas as pd
import json
import os
from dotenv import load_dotenv

from zeno_client import ZenoClient, ZenoMetric

Loading the Data

We'll be uploading already processed model outputs. You can get these here, alternatively, you can run your own models following this README.

data = json.load(open("tax-benchmark.json"))

Formatting the Data

Before we can upload the data to Zeno, we'll need to convert it to a pandas DataFrame and will format some of the data so that we have all the information we're going to need in our Zeno project.

def format_question(input):
return_question = input["source_question"]["description"].replace("\\n", "\n")
return_question += "\n\n"
for answer in enumerate(input["source_question"]["options"]):
return_question += f"{answer[0] + 1}. {answer[1]}\n"
return return_question


df_input = pd.DataFrame(
{
"question": [format_question(d) for d in data],
"answer": [str(d["source_question"]["correct_answer"]) for d in data],
"reference": [d["source_question"]["reference"] for d in data],
"tag": [d["source_question"]["tag"] for d in data],
"category": [d["source_question"]["category"] for d in data],
}
)
df_input["question length"] = df_input["question"].apply(lambda x: len(x))
df_input["id"] = df_input.index
Optional: Generate Topics

If you want you can also add topic information to the data using bertopic as follows:

from bertopic import BERTopic

topic_model = BERTopic("english", min_topic_size=3)
topics, probs = topic_model.fit_transform(
[d["source_question"]["description"] for d in data]
)
df_input["topic"] = topics
df_input["topic"] = df_input["topic"].astype(str)

Create a Zeno Project

We can now upload our data to a Zeno project. You will need your ZENO_API_KEY here, which you can generate by clicking on your profile at the top right to navigate to your account page.

Once you have your API key, you can authenticate with the Zeno client and create a project as follows:

client = zeno_client.ZenoClient(YOUR_API_KEY)

project = client.create_project(
name="LLM Taxes Benchmark",
view={
"data": {"type": "markdown"},
"label": {"type": "text"},
"output": {"type": "markdown"},
},
description="Tax questions for LLMs",
public=True,
metrics=[
ZenoMetric(name="accuracy", type="mean", columns=["correct"]),
ZenoMetric(name="output length", type="mean", columns=["output length"]),
],
)

Uploading Data

Now that all of this is set up, we can upload the data that we've formatted into a DataFrame before.

project.upload_dataset(
df_input, id_column="id", data_column="question", label_column="answer"
)

Uploading System outputs

Finally, we're going to collect all the model outputs and upload each of the models' results as systems of our Zeno project.

for model in data[0]["full"].keys():
df_system = pd.DataFrame(
{
"output": [
f"**Full:** {d['full'][model]}\n\n**Simplified**: {d['simplified'][model]}"
for d in data
],
"output length": [len(d["full"][model]) for d in data],
"simplified output": [str(d["simplified"][model]) for d in data],
}
)
df_system["correct"] = df_input["answer"] == df_system["simplified output"]
df_system["id"] = df_input["id"]
project.upload_system(
df_system, name=model.replace("/", "-"), id_column="id", output_column="output"
)