Skip to main content

Zeno's Notes on AI Evaluation | January 2024

· 4 min read
Alex Cabrera
PhD Candidate @ CMU
Alex Bäuerle
Researcher @ CMU

Welcome to the first edition of the Zeno's Notes newsletter! Each month, we'll discuss the community's work around Zeno, interesting research and projects on AI evaluation, and new Zeno features.

Before we dive in, we wanted to look back at the last few months since we launched Zeno Hub. Our users have created over 800 projects and 1,400 slices to evaluate more than 10,000 AI systems! These insights have been used to author over 160 reports, communicating interesting findings and insights. It's exciting to see how Zeno is being used to make AI evaluation more accessible and transparent.

🌎 Community

Highlighting work from the Zeno community.

An In-Depth Look at Gemini's Language Abilities

Researchers at CMU, including the Zeno team, conducted a deep dive into Gemini's language abilities. They compared Gemini Pro, Google's newly released LLM, with GPT-3.5-Turbo, GPT-4-Turbo, and Mixtral. Overall, they found that Gemini approaches but lags behind GPT-3.5-Turbo in all English tasks, yet performs better in translation into languages it supports. For more detailed results, read the paper or explore the code on GitHub. Each section of the paper is linked to a Zeno report for further exploration!

HuggingFace is Dropping DROP

The HuggingFace Open LLM Leaderboard is a popular repository for comparing new LLMs. HuggingFace recently added three new benchmarks to their leaderboard to better represent real-world performance. After receiving feedback from the community, they noticed significant fluctuations in the scores for one benchmark, DROP. Using Zeno, they uncovered the reason behind the variance and decided to remove DROP from the leaderboard until a revised version of the benchmark is developed.

📰 Evaluation News

Interesting news from the world of AI evaluation.

CRUXEval

Researchers from MIT and Meta AI Research released new evaluation dataset for code reasoning, understanding, and execution. Instead of having models generate code, CRUXEval asks models to either predict the output or input of a given function based on its signature. This dataset, which includes 800 python functions, supplements classic code generation datasets such as HumanEval and MBPP. They compared multiple open and closed-source models on their new benchmarks and found that there is quite a bit of room for improvement.

CommonGen Leaderboard

CommonGen is a challenging benchmark task asking models to generate coherent sentences describing everyday scenarios. The reseaerchers behind the benchmark, from USC, Allen AI, and UW, recently updated their eval repository with a new leaderboard for the task, showing how state-of-the-art models, including GPT-4, perform significantly worse than humans. The authors argue that the task is so hard because it requires relational reasoning using background common sense knowledge and the models need to be able to generalize to unseen concept combinations.

✨ New in Zeno

Updates to Zeno that you'll love.

Integrations

We've been focusing on making it even easier for you to analyze your evaluation results in Zeno by integrating Zeno into other AI evaluation frameworks. You can now directly upload your model outputs if you're using the EleutherAI LM Evaluation Harness or the Ragas Framework.

  • Ragas is a library for model-graded evaluation of RAG applications. We've added a detailed tutorial on how to use Zeno to investigate your evaluation results. You can view an example of this in Zeno here.

  • EleutherAI LM Evaluation Harness is a popular library for running LLM benchmarks. We wrote a script that allows you to directly upload all your evaluation data as a Zeno project, enabling you to compare different models across various benchmarks provided by EleutherAI. To start visualizing your LM Evaluation Harness data in Zeno, follow these instructions or take a look at our example notebook. We've already used this integration for some of our own projects, such as this evaluation of the Mamba architecture!

Documentation

We've also been working on improving our documentation to make it easier for you to get started with Zeno. This includes use cases, tutorials, and integration guides. If you have any suggestions for what you'd like to see in our documentation, please let us know!


If you have questions about Zeno or anything we've highlighted in this newsletter, have ideas for new Zeno features or content for a future issue of Zeno's Notes, or simply want to say hi, get in touch via email or join our Discord.