MT-Bench: Comparing different LLM Judges

By default, MT-Bench uses OpenAI as a service provider with a gpt-4 model ID, which is a vanilla GPT-4 model with 8k context introduced back in Spring 2023. However, it is possible to override the model ID via the –judge-model argument.

As of June 2024, GPT-4 series models have the following pricing (per million tokens):

Prompt
Completion

GPT-4o
$ 5,00
$ 15,00

GPT-4-Turbo (0125-preview)
$ 10,00
$ 30,00

GPT4 8K (0613)
$ 30,00
$ 60,00

By running the MT-Bench using GPT-4 Turbo or Omni one can potentially save 6 times on API calls for one evaluation. But how will the score change? Let’s find out 🙂

Costs

I have used Phi-3 Medium with 8K context and quantized at 8-bits (running inference server via LM Studio). I have executed answer generation 4 times. Then for each of the sets, I have run one judgment generation with the three models.

OpenAI API consumption cost per one eval*:

GPT-4o
$ 0,93

GPT-4-Turbo (0125-preview)
$ 1,85

GPT4 8K (0613)
$ 5,10

*I only collected the total tokens consumed by gpt-4-0613 (621805). For the calculation, I assumed that each model had similar token consumption with 580k prompt and 60k completion tokens

Reviewing the Scores

The below findings can not be generalized as they take a small sample of results for just one target model (Phi-3). Still…

For each of the LLM judges, I have calculated the mean (out of 4 runs) and standard deviation as a percentage of the mean. As you can see:

Omni tends to inflate the score by a factor of 12%
All models are quite consistent with just a 1-3% deviation in scores
The vanilla GPT-4 shows the most consistency across turns

Mean
1st Turn
2nd Turn
Avg

GPT-4o
9,13125
8,2814875
8,70720325

GPT-4-Turbo (0125-preview)
8,290625
7,5270175
7,90932575

GPT-4 8K (0613)
8,41875
7,04375
7,73125

StDev
1st Turn
2nd Turn
Avg

GPT-4o
0,00230424
0,0262376
0,01302793

GPT-4-Turbo (0125-preview)
0,00620126
0,02336659
0,01396082

GPT-4 8K (0613)
0,01178508
0,01858418
0,01152749

GPT-4 Turbo is the closest to the baseline of GPT-4 8K, 2nd turn sees the most deviation:

% of GPT4 8K

1st Turn

GPT-4o
108,5%

GPT-4-Turbo (0125-preview)
98,5%

GPT-4 8K (0613)
100,0%

Both Omni and Turbo see the least drop in 2nd turn scores:

2nd turn drop

GPT-4o
9,31%

GPT-4-Turbo (0125-preview)
9,21%

GPT-4 8K (0613)
16,33%

Raw Scores

Model
1st Turn
2nd Turn
Avg

GPT-4o #1
9,14375
8,5625
8,853125

GPT-4o #2
9,14375
8,3375
8,740625

GPT-4o #3
9,1
8,15
8,625

GPT-4o #4
9,1375
8,07595
8,610063

GPT-4-Turbo (0125-preview) #1
8,35
7,7
8,025

GPT-4-Turbo (0125-preview) #2
8,2875
7,64557
7,968553

GPT-4-Turbo (0125-preview) #3
8,3
7,4375
7,86875

GPT-4-Turbo (0125-preview) #4
8,225
7,325
7,775

GPT-4 8K (0613) #1
8,4875
7,2125
7,85

GPT-4 8K (0613) #2
8,5125
6,975
7,74375

GPT-4 8K (0613) #3
8,3
7,075
7,6875

GPT-4 8K (0613) #4
8,375
6,9125
7,64375

About

MT-Bench is a quick (and dirty?) way to evaluate a chatbot model (fine-tuned instruction following LLM). When a new open-source model is published at Hugging-face it is not uncommon to see the score presented as a testament of quality. It offers ~$5 worth of OpenAI API calls towards getting a good ballpark of how your model does. A good tool to iterate on fine-tuning an assistant model.

MT-Bench is a Python program that asks the target model 80 predefined questions (doing inference via HF Transformers or OpenAI compatible API endpoint). The questions cover Humanities, STEM, Extraction, Roleplay, Writing, Reasoning, and Coding. There are 2 turns – it asks a question and gets the answer (1st turn), then adds a follow-up question and collects the 2nd answer (2nd turn). It then iterates through all questions and asks the GPT-4 model (the legacy 8K model from Spring 2023) to score both answers on a scale from 1 to 10 (hence the lowest a model can get is 1, not 0 :). The results are 3 aggregate scores: 1st turn, 2nd turn, and average score.

########## First turn ##########
score
model turn
stablelm-2-brief-1_6b_2 1 3.240506

########## Second turn ##########
score
model turn
stablelm-2-brief-1_6b_3 2 2.443038

########## Average ##########
score
model
stablelm-2-brief-1_6b_3 2.822785

As explained in this paper, which introduced the MT-Bench and investigated the utility of LLM as an evaluator, the score shows high agreement with human preferences. I.e. the larger the MT-Bench score the higher the model gets on LMSYS Chatbot Arena.

Another popular option for LLM evaluation is AlpacaEval. This one uses a newer and cheaper GPT-4 Turbo model as a baseline. The authors of AlpacaEval provided correlation coefficients of different evals with LMSYS Arena showing a strong association between LLM judges’ scores and human preferences at the Arena: