# sort_by{rand}'s bias makes BigDecimal cry

*sort_by{ rand }*
is a common idiom to shuffle an array in Ruby. It's short, looks good and is fairly efficient
(despite being ) since Array#sort_by is done at C speed.

Unfortunately, it is biased: if the same rand() value is returned for two different elements, their relative order will be preserved.

%w[a b c d].sort_by{ 0 } # => ["a", "b", "c", "d"] i = 0 %w[a b c d].sort_by{ 10 - (i += 1) / 2 } # => ["d", "b", "c", "a"]

This means that permutations preserving the relative order of one (or more) pair of elements of the original array are a bit more probable.

How often will that happen? The Mersenne Twister PRNG used by Ruby has been proven to have a period of , so it would seem at first sight that the answer is "not before the sun turns into a nova". But Kernel#rand does not return the full state of the PRNG: you only get 53 bits of pseudo-randomness out of a call to rand(). This is becoming more interesting.

### The birthday paradox

The birthday paradox states that given 23 people in a room, the odds are over
50% that two of them will have the same birthday.
It is not actually a paradox (there's no self-contradiction), it
*merely* defies common intuition.

It turns out that the problem with sort_by{ rand } is another instance of the birthday paradox. In this case, we have a room with N people (N being the size of the array to shuffle) and there are "days in a year".

Let's try to reproduce the 23-50% result, and apply the method to the other one.

There are

ways to assign a birthday to *m* people without any repetition. Therefore,
the probability that at least two people share a birthday is

In Ruby terms, we first have to define **factorial**:

class Integer def fact0 case self when 0..1 1 else self * (self - 1).fact0 end end end

This is fairly slow and will typically fail for or so, depending on the stack size. Stirling's approximation proves helpful here: we'll use

class Integer def fact Math.sqrt(2*Math::PI*self) * Math.exp(-self) * Math.exp(self * Math.log(self)) end end

It compares rather well to the recursive factorial:

%w[fact0 fact].each do |m| t = Time.new eval <<-EOF 200.times{|i| (i+1).#{m}} EOF "#{m}: #{Time.new - t}" # => "fact0: 0.125991", "fact: 0.001527" end

Time to define A(n,m):

def A(n,m) n.fact / (n-m).fact end def C(n,m) A(n,m) / m.fact end

Given that definition, the well-known 23-50% figures should be easy to reproduce:

(21..23).each do |i| [i, "%5.3f" % (1 - A(365,i)/(365**i))] # => [21, " nan"], [22, " nan"], [23, " nan"] end

Something's wrong, and the culprit is not hard to locate:

A(365, 20) # => NaN

### BigDecimal to the rescue

Ruby's standard library includes an extension for arbitrary precision floats, BigDecimal, whose very name suggests that we can think of it as Bignum's cousin. BigDecimal objects have to be initialized with a String, but that's not much of a problem

require 'bigdecimal' require 'bigdecimal/math' BigDecimal.limit(100) class Integer BigMath = Class.new{ extend ::BigMath } def fact(prec = 10) # !> method redefined; discarding old fact BigDecimal.new((Math.sqrt(2*Math::PI*self) * Math.exp(-self)).to_s) * BigMath.exp(BigDecimal.new(self.to_s) * BigMath.log(BigDecimal.new(self.to_s), prec), prec) end end

Let's go back to the birthday problem:

(21..23).each do |i| [i, "%5.3f" % (1 - A(365,i)/(365**i))] # => [21, "0.444"], [22, "0.476"], [23, "0.507"] end

Good.

### Quantifying the bias

I'd now like to just do something like

A(2**53, 1000000) / (2**53) ** 1000000

but it's obvious that won't fly. This is where a tiny bit of mathematical insight proves much more useful than anything I could possibly find in the standard library. I can find a cheap lower bound for the probability of repetition as follows: there are possible pairs for the N picks of rand() I'll do. For each of them, the probability that the second has got the same value is . So the probability of having at least one repetition is going to be higher than

which can be approximated as

Let's code that:

include BigMath N = 2 ** 53 PICKS = [3, 4, 6, 7].map{|n| 10 ** n} Math.log(N)/Math.log(10) # => 15.954589770191 PICKS.each do |i| picks = BigDecimal.new(i.to_s) p_bias = "%7.5e" % (picks * (picks - 1) / (2 * N)) [i, p_bias] # => [1000, "5.54556e-11"], [10000, "5.55056e-09"], # [1000000, "5.55111e-05"], [10000000, "5.55111e-03"] end

So

arr.sort_by{ rand }

is going to be biased for an array with 10 million elements with a probability of around 0.55%. Just to make sure those figures are OK, here's another way to estimate that probability:

PICKS.each do |i| picks = BigDecimal.new(i.to_s) p_bias = "%7.5e" % (1.0 - exp(-picks*(picks -1) / (2*N), 100)) [i, p_bias] # => [1000, "5.54558e-11"], [10000, "5.55056e-09"], # [1000000, "5.55096e-05"], [10000000, "5.53574e-03"] end

### Back to sort_by

While there are better ways to shuffle
an array in theory (both less complex and unbiased), I'll keep using *sort_by{rand}*
for small arrays.

##### lee b. 2006-02-16 (Thr) 21:25:34

def recfact(hi, lo=2) (hi-lo>42) ? recfact(hi,(hi+lo)/2) * recfact((hi+lo)/2 - 1, lo) : (lo..hi).inject {|f,i| f*i} end

- 91 http://www.artima.com/forums/flat.jsp?forum=123&thread=139385
- 20 http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/168963
- 17 http://chneukirchen.org/anarchaia
- 13 http://www.artima.com/buzz/community.jsp?forum=123
- 8 http://anarchaia.org
- 5 http://search.msn.com/spresults.aspx?q=sort_by
- 5 http://anarchaia.org/archive/2005/12.html
- 4 http://www.anarchaia.org
- 3 http://search.msn.com/spresults.aspx?q=sort_by&first=21&count=10&FORM=POPR
- 3 http://chneukirchen.org/anarchaia/archive/2005/12/05.html

Keyword(s):[blog] [ruby] [sort_by] [rand] [bigdecimal] [birthday] [paradox]

References:[Ruby]