Elo ratings for the Benchmarks Game (aka Great Computer Language Shootout)
The geometric mean, as used by the Computer Language Benchmarks Game, gives too much importance to outliers and results in unstable rankings that do not reflect overall performance consistency well. The Elo rating system can be applied to the same results to obtain a better ranking where the influence of small speed perturbations is minimized and all-round performance is favored.
Even though the Great Computer Language Shootout is now called the Computer Language Benchmarks Game, its results are often used incorrectly to compare language implementations (and even worse, this is generalized to languages as well) performance-wise by merely comparing the scores without any further thought. Many of the problems stem from the choice of the geometric mean of the execution time compared to the fastest entry for each benchmark and implementation. More on that below.
If it's a game, why not use an actual rating system like those used in chess or go? I use the Elo rating system. A win (see below for the definition of what represents a win) is worth one point, a draw 0.5 and a loss 0, and the expected score of program A against program B is

where
represents the "strength" (current rating) of
the language implementation "X".
All the possible pairings of benchmark implementations are considered, and the ratings are updated at the end of each round (this is essential to avoid dependence on the order of the pairings) with

(K is a suitably small coefficient to avoid oscillations.)
By choosing the scoring function carefully, the resulting ratings admit an interesting interpretation and allow to draw more meaningful conclusions than the geometric mean.
Here follow the Elo ratings for the x64 Ubuntu (Intel® Q6600® quad-core) data as of 2008-09-01, where a program is considered the winner when it runs in less than half as much time as its opponent. (The programs used to generate it and the justification of this scoring function are given below.)
gpp 1405
gcc 1397
ocaml 1336
mlton 1317
java 1307
ghc 1301
fpascal 1271
cal 1271
scala 1261
gnat 1252
nice 1246
csharp 1199
hipe 952
mzscheme 928
perl 855
python 792
lua 791
pike 767
javaxint 763
yap 728
groovy 687
tcl 624
gst 607
php 532
javascript 410
This ranking includes languages omitted in the original one and the relative order differs a bit. This is because this rating reflects something different, arguably more interesting.
Source code
I use a small Ruby script (scrap.rb) to screen scrap the shootout site and dump the data in JSON format that will look like
{
"pidigits": {
"mzscheme": 28.27,
"hipe": 17.04,
...
},
"fannkuch": {
...
},
...
}
It is then processed by this OCaml program:
open Printf module H = ExtHashtbl.Hashtbl module L = ExtList.List let (|>) x f = f x let init_rating = 1000. let k = 5. let rounds = 100 let win_ratio = 1.0 /. 2.0 let actual_score ta tb = if ta <= win_ratio *. tb then 1. else if tb <= win_ratio *. ta then 0. else 0.5 let expected_score ra rb = 1. /. (1. +. 10. ** ((rb -. ra) /. 500.)) let new_ratings ra rb ta tb = let d = actual_score ta tb -. expected_score ra rb in (ra +. k *. d, rb -. k *. d) let rating_deltas ra rb ta tb = let r1, r2 = new_ratings ra rb ta tb in (r1 -. ra, r2 -. rb) type json shootout_data = (string * (string * float) assoc) assoc let get t lang = H.find_default t lang init_rating let inc t v k delta = H.replace t k (H.find_default t k v +. delta) let run data = let ratings = H.create 13 in let do_match deltas (lang1, t1) (lang2, t2) = if lang1 <> lang2 then let d1, d2 = rating_deltas (get ratings lang1) (get ratings lang2) t1 t2 in inc deltas 0. lang1 d1; inc deltas 0. lang2 d2 in let round () = let ds = H.create 13 in L.iter (fun (_, rs) -> L.iter (fun l1 -> L.iter (do_match ds l1) rs) rs) data; H.iter (inc ratings init_rating) ds in for i = 1 to rounds do round () done; H.fold (fun lang r l -> (r, lang) :: l) ratings [] |> L.sort |> L.rev |> L.iter (fun (rating, lang) -> printf "%16s %5.0f\n" lang rating) let () = match Array.length Sys.argv with 2 -> Json_io.load_json Sys.argv.(1) |> shootout_data_of_json |> run | _ -> printf "elo_rating <file>\n"; exit 0
The only remarkable thing is the use of the json-static syntax extension to load the JSON data type-safely with only one line of code which declares the type and generates the shootout_data_of_json function used to verify and load JSON data:
type json shootout_data = (string * (string * float) assoc) assoc
Deficiencies of the geometric mean and justification of the scoring function
One of the issues with means (both geometric and arithmetic) is that a single benchmark can affect the overall result enormously. A language can be slow in general but reach a high score because it is particularly good at one task.
The rating system can address this: a program that trumps the competition for a single benchmark but loses everywhere else should not rank high. The Elo rating system naturally limits the impact of any single benchmark.
As for the scoring function, I choose this definition of "win": a program wins if it is at least two times faster than its opponent. By considering only speed differences over a factor of 2, the resulting rating is less influenced by small perturbations (such as those attributable to the hardware). The rating system values consistently good performance; in other words, a language implementation rates high not by being the fastest at a few tasks, but by not sucking at anything.
With the expected score function shown above, the ratings can be interpreted as follows: for a program written in X, there's a 90% probability of it being at least two times faster than the same program written in language Y which rates 500 points lower than X. If the difference is around 1000 points (as happens with GCC C/C++ and Javascript), the probability approaches 99%.
Missing ruby - Kevin Ballard (2008-09-01 (Mon) 05:03:10)
Ruby doesn't show up in your list of language scores.
vz 2008-09-01 (Mon) 07:27:13
The obvious reason is that ruby loses to Javascript and therefore reach top 25.
The real reason is that ruby isn't included in the quad-core bench yet: http://shootout.alioth.debian.org/u64q/benchmark.php?test=all&lang=all
mfp 2008-09-01 (Mon) 07:28:47
That's because there's no Ruby data for x64 Ubuntu Intel Q6600 quad-core. It would probably be just above yap or groovy. Here's the ranking based on the Gentoo Pentium 4 benchmarks:
icpp 1487
icc 1482
gcc 1455
gpp 1454
lisaac 1442
fpascal 1424
ifc 1411
dlang 1397
clean 1392
ooc 1376
smlnj 1373
gnat 1360
javaxx 1351
ocaml 1344
java 1337
mlton 1324
se 1320
sbcl 1312
ghc 1293
javaclient 1290
fbasic 1282
java14 1274
csharp 1259
cal 1253
scala 1251
nice 1237
gcj 1234
znn 1231
g95 1200
fsharp 1159
mercury 1153
bigforth 1134
hipe 1039
luajit 1023
ikarus 1019
psyco 1017
mzscheme 1006
chicken 907
vw 904
javaxint 895
gforth 874
lua 823
pike 779
php 761
oz 752
python 741
perl 712
yarv 664
icon 620
iron 617
ruby 603
swiprolog 591
tcl 569
yap 563
groovy 553
squeak 550
jruby 499
gst 496
javascript 478
cint 352
rhino 350
rebol 0
io 0
Hasse diagram - hsuh (2008-09-01 (Mon) 05:54:30)
What about something like http://tartarus.org/simon/2008-olympics-hasse/ for the Computer Language Shootout? To see if there is any language that is really better than another...
mfp 2008-09-01 (Mon) 07:34:25
There are relatively few language pairings for which one is better than the other at all benchmarks. I suspect a Hasse diagram would show essentially two large clusters with few internal arrows, with the implementations that don't suffer from the interpretation overhead pointing to those that do.
x2 factor - Paolo Bonzini (2008-09-01 (Mon) 08:31:37)
A x2 factor seems very hard. What happens if you change the factor down to 1.5 or 1.2?
mfp 2008-09-01 (Mon) 09:11:51
With 1.5, nothing changes below mzscheme. The top entries become
gpp 1536
gcc 1520
java 1404
ocaml 1377
ghc 1339
nice 1298
mlton 1295
gnat 1289
scala 1284
cal 1283
fpascal 1254
csharp 1167
hipe 946
X factor - Justin Ritter (2008-09-01 (Mon) 08:52:29)
The whole x2 factor thing jsut totally blows me away.
igouy2@yahoo.com - Isaac Gouy (2008-09-01 (Mon) 09:50:32)
> The geometric mean, as used by the Computer Language Benchmarks, gives too much importance to outliers and results in unstable rankings...
You don't seem to have tried to show that in the benchmarks game it does in fact result in unstable rankings - where's your evidence?
> Even though the Great Computer Language Shootout is now called the Computer Language Benchmarks Game ...
Even though it's called the benchmarks game some people keep calling it the language shootout, somehow failing to read the really big letters we write Benchmarks Game with :-)
Just stop calling it the language shootout! That won't change how people choose to understand the data but eventually we might all get the name right.
mfp 2008-09-01 (Mon) 10:32:30
You don't seem to have tried to show that in the benchmarks game it does in fact result in unstable rankings - where's your evidence?
This is fairly obvious and caused by the very nature of the means; I don't think it needs much justification really (for instance, GHC was at the top for a while a couple days ago or so). Compare this to this and look at the scores. With the Elo rating, the inclusion or not of threadring only results in a difference of 6 points for GHC.
The very text in the "Create your own ranking" page acknowledges this implicitly: "What fun! Can you manipulate the multipliers and weights to make your favourite language the best programming language in the Benchmarks Game?". It is because the weighted geometric mean is used that the performance at a single benchmark (and the weight assigned to it) can make a very large difference. The ranking according to the Elo rating system is more stable.
Isaac Gouy 2008-09-01 (Mon) 18:00:32
> ... I don't think it needs much justification really
iirc Aristotle thought it fairly obvious that men and women had different numbers of teeth because it followed from the very nature of men and women - apparently he did not think this needed the justification of actually looking in people's mouths.
> Compare this to this and look at the scores.
You do realize that the Q6600 measurements are still being accumulated?
Do that comparison with the Gentoo Pentium 4 measurements and tell us how much (or how little) difference it makes.
mfp 2008-09-02 (Tue) 04:33:18
iirc Aristotle thought it fairly obvious that men and women had different numbers of teeth because it followed from the very nature of men and women - apparently he did not think this needed the justification of actually looking in people's mouths.
The difference is that one statement stems from a mathematical definition and the other from Aristotle's ass ;-)
You do realize that the Q6600 measurements are still being accumulated? Do that comparison with the Gentoo Pentium 4 measurements and tell us how much (or how little) difference it makes.
It's not as dramatic as with the quad Q6600, but the difference is still fairly large; note for instance how the fact that thread-ring does run with C Intel but not with C++ Intel makes the former seem 20% slower overall when that benchmark is enabled. The Elo ratings OTOH (and not only the rankings) are stable even with few data points: the difference in rating for GNU C between enabling and disabling thread-ring is around 2%... Admittedly, as the number of benchmarks approaches infinity, the geometric mean will work better (because the disparity in speed for any given benchmark is limited). Its limitations are more obvious when there are few measurements.
Isaac Gouy 2008-09-02 (Tue) 10:33:48
> the difference is still fairly large
For some definition of fairly large :-)
1.7 Haskell GHC 2.08 with thread-ring
1.8 Haskell GHC 2.19 without thread-ring
> for instance, GHC was at the top for a while a couple days ago or so
... because gcc and gpp are being /excluded/ because they have < 70% completed programs, because I haven't installed all the libraries - yet. Nothing to do with the ranking method.
> Admittedly, as the number of benchmarks approaches infinity, the geometric mean will work better ...
What about as the number of benchmarks approaches 15?
I like that you've been imaginative in responding to what you saw as a problem with the benchmarks game, and then gone ahead prototyping an alternative and exploring what it can do!
What you've done is interesting, it isn't obvious to me that it's "better".
It seems likely that the normal distribution assumption of ELO doesn't apply to the benchmarks game measurements - very slow or very fast performance of a particular program is not an aberration that will go away the next time that program is measured.
mfp 2008-09-03 (Wed) 17:23:01
1.7 Haskell GHC 2.08 with thread-ring
1.8 Haskell GHC 2.19 without thread-ring
That's not the most affected entry; I was referring to
the fact that thread-ring does run with C Intel but not with C++ Intel makes the former seem
20%15% slower overall when that benchmark is enabled.
Some implementations are penalized, others are rewarded for not having that benchmark(!). This wouldn't be an issue if we had all the required results, but it reflects that there are problems with the geometric mean and incomplete data.
With the Elo rating system, the difference in rating for GNU gcc is below 3% in the Q6600 dataset, with fewer data points (by way of comparison, GNU gcc's geometric mean fluctuates by 30% depending on whether thread-ring is enabled or not).
What you've done is interesting, it isn't obvious to me that it's "better".
It seems likely that the normal distribution assumption of ELO doesn't apply to the benchmarks game measurements - very slow or very fast performance of a particular program is not an aberration that will go away the next time that program is measured.
Note that I'm not considering each measurement as a stochastic process sampled to produce a number of random variables, but as a single random variable: the ratings are computed from scratch each time the data is processed instead of building on the previous info. (The above program did employ multiple rounds to achieve convergence, but the results with a single one are similar and could be used if iterating proved to be a problem; it just requires a more careful choice of the K constant.)
You're right in that the Elo approach probably isn't statistically well-grounded (even if the assumptions hold, I read). It does work better than the geometric mean given few data points, though, as is the case with the Q6600 set. It's probably not worth the effort in this regard given enough data (with maybe a few more benchmarks than those in the older Gentoo runs).
Another thing I like about the Elo rating, apart from the limited influence of any single benchmark, is that it's harder to draw the flawed conclusions the geometric mean is easily subject to, like "lang X is 15% slower than Y". Instead, the rating gives you the a priori probability of a given program being substantially faster than some other depending on the implementation languages (given enough optimization effort). In other words, ratings are a bit more abstract and harder to abuse.
Isaac Gouy 2008-09-04 (Thr) 09:25:46
> It does work better than the geometric mean given few data points, though, as is the case with the Q6600 set.
In a couple of weeks the Q6600 data sets will have settled down, so maybe (as you have the program written) you could take another snapshot then.
> ... harder to draw the flawed conclusions ... harder to abuse.
I think you underestimate what talent and effort can achieve :-)
Isaac Gouy 2008-09-16 (Tue) 13:11:54
You can get the current data files using anonymous csv
cvs -d :pserver:anonymous@cvs.alioth.debian.org:/cvsroot/shootout login
cvs -d :pserver:anonymous@cvs.alioth.debian.org:/cvsroot/shootout checkout shootout/website/websites/u32q/data/u32q_bulkdata.csv.bz2
new file engine search - Visitor (2008-10-30 (Thr) 16:07:47)
I saw a very good review of it at http://newfileengine.com/ - use the search and follow the link…
- 545 http://www.reddit.com/r/programming
- 403 http://www.reddit.com
- 291 http://news.ycombinator.com
- 190 http://www.reddit.com/r/programming/comments/6z1a2/eigenclass_elo_ratings_for_the_benchmarks_game
- 37 http://news.ycombinator.com/news
- 23 http://popurls.com
- 21 http://www.reddit.com/r/programming/new
- 20 http://planetruby.0x42.net
- 14 http://www.artima.com/forums/flat.jsp?forum=123&thread=237643
- 12 http://news.ycombinator.com/newest