Visualize Open LLM Leaderboard Outputs

The Open LLM Leaderboard is the go-to place for comparing large language models on different benchmark tasks. However, the leaderboard only surfaces average metric scores, which the details of how these models actually behave.

Fortunately the 🤗 team releases the raw outputs of each run, and we can use Zeno visualize the the results and find interesting findings!

For example, we worked with the 🤗 team to debug the benchmarks used in the leaderboard using Zeno. Check out our blog post!

Dependencies

We'll use the 🤗 Python library to download the model outputs from the leaderboard:

pip install datasets numpy pandas zeno-client

from zeno_client import ZenoClient, ZenoMetric
import datasets
import numpy as np
import pandas as pd
import os

Tasks

The leaderboard features six different benchmarks: ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, and GSM8k. Models that are included in the leaderboard are evaluated on all of these tasks. To form a leaderboard, an average score is calculated for each of the six tasks before these scores are averaged again to get an overall score.

To get started, let's set up a list of models we want to download outputs for:

models = [
  "meta-llama/Llama-2-70b-hf", "mistralai/Mistral-7B-v0.1", "tiiuae/falcon-40b"
]

Feel free to change the list of models used to any available on the leaderboard. Very few of them might not have associated data, you can check this by clicking on the little icon next to the model name. If you get a 404 after clicking, we won't be able to fetch the model data.

Fetching the data differs slightly between benchmarks. Let's look at how to fetch data for the ARC and HellaSwag benchmarks for now. Data for the other benchmarks can be loaded in a similar manner.

ARC

For ARC, we can get the data by loading a single dataset file from HuggingFace:

def get_arc_data(model: str):
    data_path = "details_" + model.replace("/", "__")
    return datasets.load_dataset(
        "open-llm-leaderboard/" + data_path,
        "harness_arc_challenge_25",
    )

We'll then also want to parse the data into a Pandas DataFrame for the raw dataset and each system as follows:

labels = ["A", "B", "C", "D", "E"]

def generate_dataset(df):
    df_lim = df[["example", "choices", "gold"]]
    df_lim.loc[:, "data"] = df_lim.apply(
      lambda x: "\n" + x["example"] + "\n" + "\n".join(
        f"{labels[i]}: {x}" for i,x in enumerate(x['choices'])
      ),
      axis=1
    )
    df_lim.loc[:, "num_choices"] = df_lim.apply(lambda x: str(len(x['choices'])), axis=1)
    df_lim.loc[:, "label"] = df_lim.apply(lambda x: labels[x["gold"]], axis=1)
    df_lim = df_lim.drop(columns=["example", "choices", "gold"])
    df_lim["id"] = df_lim.index
    return df_lim

def generate_system(df):
    df_system = df[["predictions", "acc_norm", "choices", "acc"]]
    df_system["answer_raw"] = df_system.apply(lambda x: labels[np.argmax(x['predictions'])], axis=1)
    df_system["answer_norm"] = df_system.apply(
      lambda x: labels[np.argmax(x['predictions']/np.array([float(len(i)) for i in x['choices']]))],
      axis=1
    )
    df_system["predictions"] = df_system.apply(
      lambda x: x['answer_norm'] + "\n\n" + "Raw Pred.: " + ", ".join(
        map(lambda y: str(round(y, 2)), x['predictions'])) + "\nNorm Pred.: " + ", ".join(
          map(lambda y: str(round(y, 2)), x['predictions']/np.array([float(len(i)) for i in x['choices']])
        )
      ),
      axis=1
    )
    df_system["correct"] = df_system.apply(lambda x: True if x['acc_norm'] > 0 else False, axis=1)
    df_system["correct_raw"] = df_system.apply(lambda x: True if x['acc'] > 0 else False, axis=1)
    df_system = df_system.drop(columns=["acc_norm", "choices", "acc"])
    df_system["id"] = df_system.index
    return df_system

This will convert the raw dataframe we get when loading the dataset into an enriched, analyzable dataframe that we can use going forward. We're then going to call these functions to get the raw data and system outputs for the ARC benchmark as follows:

arc_df = generate_dataset(get_arc_data(models[0])['latest'].to_pandas())
num_rows = len(df)
arc_system_dfs = []
for model in models:
    dataset = get_arc_data(model)['latest'].to_pandas()
    if len(dataset) != num_rows:
        print("Skipping {} because it has {} rows instead of {}".format(model, len(dataset), num_rows))
        continue
    arc_system_dfs.append({"model": model, "df": generate_system(dataset)})

To make sure we're working with the right data, we check whether the number of rows for each system matches the number of rows in the dataset.

HellaSwag

For HellaSwag, we can reuse most of the functionality we've created for ARC. We'll need to fetch a different file to get the data:

def get_hellaswag_data(model: str):
    data_path = "details_" + model.replace("/", "__")
    return datasets.load_dataset(
        "open-llm-leaderboard/" + data_path,
        "harness_hellaswag_10",
    )

But can re-use the data transformation functions. We'll just need to create the appropriate DataFrames for HellaSwag using these functions:

hellaswag_df = generate_dataset(get_hellaswag_data(models[0])['latest'].to_pandas())
num_rows = len(df)
hellaswag_system_dfs = []
for model in models:
    dataset = get_hellaswag_data(model)['latest'].to_pandas()
    if len(dataset) != num_rows:
        print("Skipping {} because it has {} rows instead of {}".format(model, len(dataset), num_rows))
        continue
    hellaswag_system_dfs.append({"model": model, "df": generate_system(dataset)})

Visualization

To visualize the data we've just downloaded and extracted, we're going to use Zeno. You will need your ZENO_API_KEY for this. If you don't have one yet, generate your API key by clicking on your profile at the top right to navigate to your account page.

We will then create one Zeno project for each benchmark as follows:

client = ZenoClient(YOUR_API_KEY)

# ARC
proj = client.create_project(
    name="ARC",
    view="text-classification",
    description="ARC (https://arxiv.org/abs/1803.05457) task in the Open-LLM-Leaderboard.",
    metrics=[
        ZenoMetric(name="accuracy", type="mean", columns=["correct"])
    ]
)
proj.upload_dataset(arc_df, id_column="id", label_column="label", data_column="data")
for arc_system_df in arc_system_dfs:
  proj.upload_system(
    proj.upload_system(arc_system_df['df'], name=arc_system_df['model'].replace('/', "__"),
      output_column="predictions", id_column="id"
    )
  )

# HellaSwag
proj = client.create_project(
    name="HellaSwag",
    view="text-classification",
    description="HellaSwag (https://arxiv.org/abs/1905.07830) task in the Open-LLM-Leaderboard.",
    metrics=[
        ZenoMetric(name="accuracy", type="mean", columns=["correct"])
    ]
)
proj.upload_dataset(hellaswag_df, id_column="id", label_column="label", data_column="data")
for hellaswag_system_df in hellaswag_system_dfs:
  proj.upload_system(
    proj.upload_system(hellaswag_system_df['df'], name=hellaswag_system_df['model'].replace('/', "__"),
      output_column="predictions", id_column="id"
    )
  )

Other Tasks

This tutorial is only meant to demonstrate how you can get the raw system outputs for the different tasks in the Open LLM Leaderboard. We've done this by explaining how to download the data for two benchmarks, namely ARC and HellaSwag. If you are interested in doing the same for the remaining tasks, check out the following:

Benchmark	Links
MMLU
TruthfulQA
Winogrande
GSM8k

Conclusion

As we've seen, it is relatively easy to get the raw outputs behind the Open LLM Leaderboard. When fetching these raw outputs, we can analyze model behavior in much more detail compared to when just looking at the average benchmark scores. Zeno can be very helpful in such an in-depth analysis and comparison of different models.

Visualize Open LLM Leaderboard Outputs

Dependencies​

Tasks​

ARC​

HellaSwag​

Visualization​

Other Tasks​

Conclusion​