Replicating Academic Benchmarks with Airtrain: MMLU

We show how Airtrain can easily replicate the MMLU benchmark results for the Llama 2 family of models.

Academic benchmarks are the community's most useful tool to rank models into leaderboards. For example, the HuggingFace LLM leaderboard takes the average result from six popular benchmarks: HellaSwag, MMLU, ARC, Truthful QA, Winogrande, and GSM8K.

Benchmarks are carefully curated datasets of prompts targeting specific domain knowledge areas or tasks.

For example:

  • MMLU tests general knowledge with multiple choice questions across a wide variety of topics ranging from high school chemistry to international law and moral ethics
  • ARC is a dataset of multiple choice questions extracted from grade 3 to 9 exams
  • HellaSwag is a challenge dataset for evaluating commonsense NLI that is specially hard for state-of-the-art models, though its questions are trivial for humans (>95% accuracy).

In this article, we will demonstrate how to replicate MMLU benchmark results with Airtrain for the Llama family of models.

The MMLU dataset

MMLU (Massive Multitask Language Understanding) is a benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model's blind spots.

Preparing the dataset

We can download the MMLU test dataset in CSV format from HuggingFace here.

The dataset is broken down into individual files per topic. We will collate all topics into a single file and we will convert it to JSONL format for greater robustness.

You can download the final JSONL file here, or do the conversion yourself with the below code snippet.

The final schema for each example will be as follows:

  "topic": "high school chemistry",
  "question": "Chlorine gas reacts most readily with:",
  "answer_a": "toluene",
  "answer_b": "ethylene",
  "answer_c": "ethanoic acid",
  "answer_d": "ethane",
  "correct_answer": "B"

Uploading the file

In the top menu bar, we click "New job".

Then select "JSONL file upload" in the Source type dropdown. Click "Choose file" and find your mmlu.jsonl file.

Upload mmlu.json to Airtrain

Configure the models

In the central panel, click the + button next to the model you want to configure.

Name your configuration, for example simply "Llama 2 7B". Select the 7B variant, set the temperature to 0.1 and paste the following prompt.

Here is a question on the topic of {{topic}}.

Question: {{question}}

Which of the following answers is correct?

A. {{answer_a}}
B. {{answer_b}}
C. {{answer_c}}
D. {{answer_d}}

State the letter corresponding to the correct answer.

Then, configure as many other models and variants as you want. For example, Llama 2 13B and 70B.

Evaluation metrics


Model performance on the MMLU benchmark is measured as a pass rate: what fraction of questions are answered correctly by the model?

To replicate this with Airtrain, we will create a Correctness property with the following description:

This score describes whether the chatbot selected the correct answer.The correct answer is {{correct_answer}}.Here is a scoring rubric to use:1. The chatbot's answer is not {{correct_answer}}, therefore the chatbot is incorrect.5. The chatbot's answer is {{correct_answer}}, therefore the chatbot is correct.

This score describes whether the chatbot selected the correct answer. The correct answer is {{correct_answer}}.

Here is a scoring rubric to use:
1. The chatbot's answer is not {{correct_answer}}, therefore the chatbot is incorrect.
5. The chatbot's answer is {{correct_answer}}, therefore the chatbot is correct.

Airtrain's scoring model grades inferences on a Likert scale of 1 to 5. In this case, we want to measure a binary pass/fail rate, so we will use only two scores, e.g. 1 (fail) and 5 (pass) as shown above.

We can interpolate the property description with the correct answer that is provided in the input dataset.


Out of curiosity, we also activate the Length unsupervised metrics, to get a sense of what variant is more verbose.

Evaluation results

View the public results page here.


On this plot we can measure the following pass rates (score of 5) and compare them with official MMLU benchmark results listed here.

Comparing Airtrain results with official results

We can see that Airtrain's scoring model comes close to the official MMLU benchmark results.

As expected, we also note that higher correctness correlates with larger model size.


On this plot we can note that the 7B variant is more verbose than 13B and 70B variants. 13B is the most concise variant.


In this article, we showed that replicating the MMLU academic benchmark with Airtrain is trivially easy. Airtrain makes it trivially simple to evaluate LLMs across large eval datasets and for arbitrary properties, including academic benchmarks.

Sign up for early access to Airtrain's free batch evaluation tool.

The Airtrain Al Youtube channel

Subscribe now to learn about Large Language Models, stay up to date with Al news, and discover Airtrain Al's product features.

Subscribe now