eigenclass logo
MAIN  Index  Search  Changes  PageRank  Login

Elo ratings for the Benchmarks Game (aka Great Computer Language Shootout)

The geometric mean, as used by the Computer Language Benchmarks Game, gives too much importance to outliers and results in unstable rankings that do not reflect overall performance consistency well. The Elo rating system can be applied to the same results to obtain a better ranking where the influence of small speed perturbations is minimized and all-round performance is favored.

Even though the Great Computer Language Shootout is now called the Computer Language Benchmarks Game, its results are often used incorrectly to compare language implementations (and even worse, this is generalized to languages as well) performance-wise by merely comparing the scores without any further thought. Many of the problems stem from the choice of the geometric mean of the execution time compared to the fastest entry for each benchmark and implementation. More on that below.

If it's a game, why not use an actual rating system like those used in chess or go? I use the Elo rating system. A win (see below for the definition of what represents a win) is worth one point, a draw 0.5 and a loss 0, and the expected score of program A against program B is

1 over { 1 + 10 sup { (R sub B - R sub A ) / 500 } }

where R sub X represents the "strength" (current rating) of the language implementation "X".

All the possible pairings of benchmark implementations are considered, and the ratings are updated at the end of each round (this is essential to avoid dependence on the order of the pairings) with

R' sub A  = R sub A + sum from benchmarks { K ( ActualScore sub A - ExpectedScore sub A ) }

(K is a suitably small coefficient to avoid oscillations.)

By choosing the scoring function carefully, the resulting ratings admit an interesting interpretation and allow to draw more meaningful conclusions than the geometric mean.

Here follow the Elo ratings for the x64 Ubuntu (Intel® Q6600® quad-core) data as of 2008-09-01, where a program is considered the winner when it runs in less than half as much time as its opponent. (The programs used to generate it and the justification of this scoring function are given below.)

            gpp   1405
            gcc   1397
          ocaml   1336
          mlton   1317
           java   1307
            ghc   1301
        fpascal   1271
            cal   1271
          scala   1261
           gnat   1252
           nice   1246
         csharp   1199
           hipe    952
       mzscheme    928
           perl    855
         python    792
            lua    791
           pike    767
       javaxint    763
            yap    728
         groovy    687
            tcl    624
            gst    607
            php    532
     javascript    410

This ranking includes languages omitted in the original one and the relative order differs a bit. This is because this rating reflects something different, arguably more interesting.

Source code

I use a small Ruby script (scrap.rb) to screen scrap the shootout site and dump the data in JSON format that will look like

 {
   "pidigits": {
     "mzscheme": 28.27,
     "hipe": 17.04,
     ...
   },
   "fannkuch": {
     ...
   },
   ...
 }

It is then processed by this OCaml program:

open Printf
module H = ExtHashtbl.Hashtbl
module L = ExtList.List
let (|>) x f = f x

let init_rating = 1000.
let k = 5.
let rounds = 100
let win_ratio = 1.0 /. 2.0

let actual_score ta tb =
  if ta <= win_ratio *. tb then 1.
  else if tb <= win_ratio *. ta then 0.
  else 0.5

let expected_score ra rb = 1. /. (1. +. 10. ** ((rb -. ra) /. 500.))

let new_ratings ra rb ta tb =
  let d = actual_score ta tb -. expected_score ra rb in
    (ra +. k *. d, rb -. k *. d)

let rating_deltas ra rb ta tb =
  let r1, r2 = new_ratings ra rb ta tb in
    (r1 -. ra, r2 -. rb)

type json shootout_data = (string * (string * float) assoc) assoc

let get t lang = H.find_default t lang init_rating
let inc t v k delta = H.replace t k (H.find_default t k v +. delta)

let run data =
  let ratings = H.create 13 in
  let do_match deltas (lang1, t1) (lang2, t2) =
    if lang1 <> lang2 then
      let d1, d2 = rating_deltas (get ratings lang1) (get ratings lang2) t1 t2 in
        inc deltas 0. lang1 d1;
        inc deltas 0. lang2 d2 in
  let round () =
    let ds = H.create 13 in
      L.iter (fun (_, rs) -> L.iter (fun l1 -> L.iter (do_match ds l1) rs) rs) data;
      H.iter (inc ratings init_rating) ds
  in
    for i = 1 to rounds do round () done;
    H.fold (fun lang r l -> (r, lang) :: l) ratings [] |> L.sort |> L.rev |>
    L.iter (fun (rating, lang) -> printf "%16s  %5.0f\n" lang rating)

let () = match Array.length Sys.argv with
    2 -> Json_io.load_json Sys.argv.(1) |> shootout_data_of_json |> run
  | _ -> printf "elo_rating <file>\n";
         exit 0

The only remarkable thing is the use of the json-static syntax extension to load the JSON data type-safely with only one line of code which declares the type and generates the shootout_data_of_json function used to verify and load JSON data:

 type json shootout_data = (string * (string * float) assoc) assoc

Deficiencies of the geometric mean and justification of the scoring function

One of the issues with means (both geometric and arithmetic) is that a single benchmark can affect the overall result enormously. A language can be slow in general but reach a high score because it is particularly good at one task.

The rating system can address this: a program that trumps the competition for a single benchmark but loses everywhere else should not rank high. The Elo rating system naturally limits the impact of any single benchmark.

As for the scoring function, I choose this definition of "win": a program wins if it is at least two times faster than its opponent. By considering only speed differences over a factor of 2, the resulting rating is less influenced by small perturbations (such as those attributable to the hardware). The rating system values consistently good performance; in other words, a language implementation rates high not by being the fastest at a few tasks, but by not sucking at anything.

With the expected score function shown above, the ratings can be interpreted as follows: for a program written in X, there's a 90% probability of it being at least two times faster than the same program written in language Y which rates 500 points lower than X. If the difference is around 1000 points (as happens with GCC C/C++ and Javascript), the probability approaches 99%.



Missing ruby - Kevin Ballard (2008-09-01 (Mon) 05:03:10)

Ruby doesn't show up in your list of language scores.

vz 2008-09-01 (Mon) 07:27:13

The obvious reason is that ruby loses to Javascript and therefore reach top 25.

The real reason is that ruby isn't included in the quad-core bench yet: http://shootout.alioth.debian.org/u64q/benchmark.php?test=all&lang=all

mfp 2008-09-01 (Mon) 07:28:47

That's because there's no Ruby data for x64 Ubuntu Intel Q6600 quad-core. It would probably be just above yap or groovy. Here's the ranking based on the Gentoo Pentium 4 benchmarks:

           icpp   1487
            icc   1482
            gcc   1455
            gpp   1454
         lisaac   1442
        fpascal   1424
            ifc   1411
          dlang   1397
          clean   1392
            ooc   1376
          smlnj   1373
           gnat   1360
         javaxx   1351
          ocaml   1344
           java   1337
          mlton   1324
             se   1320
           sbcl   1312
            ghc   1293
     javaclient   1290
         fbasic   1282
         java14   1274
         csharp   1259
            cal   1253
          scala   1251
           nice   1237
            gcj   1234
            znn   1231
            g95   1200
         fsharp   1159
        mercury   1153
       bigforth   1134
           hipe   1039
         luajit   1023
         ikarus   1019
          psyco   1017
       mzscheme   1006
        chicken    907
             vw    904
       javaxint    895
         gforth    874
            lua    823
           pike    779
            php    761
             oz    752
         python    741
           perl    712
           yarv    664
           icon    620
           iron    617
           ruby    603
      swiprolog    591
            tcl    569
            yap    563
         groovy    553
         squeak    550
          jruby    499
            gst    496
     javascript    478
           cint    352
          rhino    350
          rebol      0
             io      0
Name:
Comment:
TextFormattingRules
Preview:

Hasse diagram - hsuh (2008-09-01 (Mon) 05:54:30)

What about something like http://tartarus.org/simon/2008-olympics-hasse/ for the Computer Language Shootout? To see if there is any language that is really better than another...

mfp 2008-09-01 (Mon) 07:34:25

There are relatively few language pairings for which one is better than the other at all benchmarks. I suspect a Hasse diagram would show essentially two large clusters with few internal arrows, with the implementations that don't suffer from the interpretation overhead pointing to those that do.

Name:
Comment:
TextFormattingRules
Preview:

x2 factor - Paolo Bonzini (2008-09-01 (Mon) 08:31:37)

A x2 factor seems very hard. What happens if you change the factor down to 1.5 or 1.2?

mfp 2008-09-01 (Mon) 09:11:51

With 1.5, nothing changes below mzscheme. The top entries become

            gpp   1536
            gcc   1520
           java   1404
          ocaml   1377
            ghc   1339
           nice   1298
          mlton   1295
           gnat   1289
          scala   1284
            cal   1283
        fpascal   1254
         csharp   1167
           hipe    946
Name:
Comment:
TextFormattingRules
Preview:

X factor - Justin Ritter (2008-09-01 (Mon) 08:52:29)

The whole x2 factor thing jsut totally blows me away.

http://www.useurl.us/17n

Name:
Comment:
TextFormattingRules
Preview:

igouy2@yahoo.com - Isaac Gouy (2008-09-01 (Mon) 09:50:32)

> The geometric mean, as used by the Computer Language Benchmarks, gives too much importance to outliers and results in unstable rankings...

You don't seem to have tried to show that in the benchmarks game it does in fact result in unstable rankings - where's your evidence?

> Even though the Great Computer Language Shootout is now called the Computer Language Benchmarks Game ...

Even though it's called the benchmarks game some people keep calling it the language shootout, somehow failing to read the really big letters we write Benchmarks Game with :-)

Just stop calling it the language shootout! That won't change how people choose to understand the data but eventually we might all get the name right.

mfp 2008-09-01 (Mon) 10:32:30

You don't seem to have tried to show that in the benchmarks game it does in fact result in unstable rankings - where's your evidence?

This is fairly obvious and caused by the very nature of the means; I don't think it needs much justification really (for instance, GHC was at the top for a while a couple days ago or so). Compare this to this and look at the scores. With the Elo rating, the inclusion or not of threadring only results in a difference of 6 points for GHC.

The very text in the "Create your own ranking" page acknowledges this implicitly: "What fun! Can you manipulate the multipliers and weights to make your favourite language the best programming language in the Benchmarks Game?". It is because the weighted geometric mean is used that the performance at a single benchmark (and the weight assigned to it) can make a very large difference. The ranking according to the Elo rating system is more stable.

Isaac Gouy 2008-09-01 (Mon) 18:00:32

> ... I don't think it needs much justification really

iirc Aristotle thought it fairly obvious that men and women had different numbers of teeth because it followed from the very nature of men and women - apparently he did not think this needed the justification of actually looking in people's mouths.

> Compare this to this and look at the scores.

You do realize that the Q6600 measurements are still being accumulated?

Do that comparison with the Gentoo Pentium 4 measurements and tell us how much (or how little) difference it makes.

mfp 2008-09-02 (Tue) 04:33:18

iirc Aristotle thought it fairly obvious that men and women had different numbers of teeth because it followed from the very nature of men and women - apparently he did not think this needed the justification of actually looking in people's mouths.

The difference is that one statement stems from a mathematical definition and the other from Aristotle's ass ;-)

You do realize that the Q6600 measurements are still being accumulated? Do that comparison with the Gentoo Pentium 4 measurements and tell us how much (or how little) difference it makes.

It's not as dramatic as with the quad Q6600, but the difference is still fairly large; note for instance how the fact that thread-ring does run with C Intel but not with C++ Intel makes the former seem 20% slower overall when that benchmark is enabled. The Elo ratings OTOH (and not only the rankings) are stable even with few data points: the difference in rating for GNU C between enabling and disabling thread-ring is around 2%... Admittedly, as the number of benchmarks approaches infinity, the geometric mean will work better (because the disparity in speed for any given benchmark is limited). Its limitations are more obvious when there are few measurements.

Isaac Gouy 2008-09-02 (Tue) 10:33:48

> the difference is still fairly large

For some definition of fairly large :-)

1.7 Haskell GHC 2.08 with thread-ring

1.8 Haskell GHC 2.19 without thread-ring

> for instance, GHC was at the top for a while a couple days ago or so

... because gcc and gpp are being /excluded/ because they have < 70% completed programs, because I haven't installed all the libraries - yet. Nothing to do with the ranking method.

> Admittedly, as the number of benchmarks approaches infinity, the geometric mean will work better ...

What about as the number of benchmarks approaches 15?

I like that you've been imaginative in responding to what you saw as a problem with the benchmarks game, and then gone ahead prototyping an alternative and exploring what it can do!

What you've done is interesting, it isn't obvious to me that it's "better".

It seems likely that the normal distribution assumption of ELO doesn't apply to the benchmarks game measurements - very slow or very fast performance of a particular program is not an aberration that will go away the next time that program is measured.

mfp 2008-09-03 (Wed) 17:23:01

1.7 Haskell GHC 2.08 with thread-ring

1.8 Haskell GHC 2.19 without thread-ring

That's not the most affected entry; I was referring to

the fact that thread-ring does run with C Intel but not with C++ Intel makes the former seem 20% 15% slower overall when that benchmark is enabled.

Some implementations are penalized, others are rewarded for not having that benchmark(!). This wouldn't be an issue if we had all the required results, but it reflects that there are problems with the geometric mean and incomplete data.

With the Elo rating system, the difference in rating for GNU gcc is below 3% in the Q6600 dataset, with fewer data points (by way of comparison, GNU gcc's geometric mean fluctuates by 30% depending on whether thread-ring is enabled or not).

What you've done is interesting, it isn't obvious to me that it's "better".

It seems likely that the normal distribution assumption of ELO doesn't apply to the benchmarks game measurements - very slow or very fast performance of a particular program is not an aberration that will go away the next time that program is measured.

Note that I'm not considering each measurement as a stochastic process sampled to produce a number of random variables, but as a single random variable: the ratings are computed from scratch each time the data is processed instead of building on the previous info. (The above program did employ multiple rounds to achieve convergence, but the results with a single one are similar and could be used if iterating proved to be a problem; it just requires a more careful choice of the K constant.)

You're right in that the Elo approach probably isn't statistically well-grounded (even if the assumptions hold, I read). It does work better than the geometric mean given few data points, though, as is the case with the Q6600 set. It's probably not worth the effort in this regard given enough data (with maybe a few more benchmarks than those in the older Gentoo runs).

Another thing I like about the Elo rating, apart from the limited influence of any single benchmark, is that it's harder to draw the flawed conclusions the geometric mean is easily subject to, like "lang X is 15% slower than Y". Instead, the rating gives you the a priori probability of a given program being substantially faster than some other depending on the implementation languages (given enough optimization effort). In other words, ratings are a bit more abstract and harder to abuse.

Isaac Gouy 2008-09-04 (Thr) 09:25:46

> It does work better than the geometric mean given few data points, though, as is the case with the Q6600 set.

In a couple of weeks the Q6600 data sets will have settled down, so maybe (as you have the program written) you could take another snapshot then.

> ... harder to draw the flawed conclusions ... harder to abuse.

I think you underestimate what talent and effort can achieve :-)

Isaac Gouy 2008-09-16 (Tue) 13:11:54

You can get the current data files using anonymous csv

cvs -d :pserver:anonymous@cvs.alioth.debian.org:/cvsroot/shootout login

cvs -d :pserver:anonymous@cvs.alioth.debian.org:/cvsroot/shootout checkout shootout/website/websites/u32q/data/u32q_bulkdata.csv.bz2

Name:
Comment:
TextFormattingRules
Preview:

new file engine search - Visitor (2008-10-30 (Thr) 16:07:47)

I saw a very good review of it at http://newfileengine.com/ - use the search and follow the link…

Name:
Comment:
TextFormattingRules
Preview:

Use this form to create a new top-level comment; for direct replies to existing comments, use the text entries you'll find below.

Name:
Subject:

TextFormattingRules
Preview:
Last modified:2008/09/01 04:15:01
Keyword(s):[blog] [frontpage] [ruby] [ocaml] [shootout] [benchmark] [rating] [ranking] [Elo]
References: