# Doing some n-gram analysis over Ruby's docs

The first attempts to optimize my pure-Ruby, 200LoC full-text search engine based on suffix arrays (which evolved into the in-progress FTSearch) led me to perform some n-gram analysis over Ruby's core/stdlib docs (as well as a pack of gems).

With a suffix array, locating a word in the corpus involves a binary search of complexity , N being the total number of suffixes. Since the suffix array (SA) is but a large array of integer values representing offsets in the "full text" file (which correspond to the start of the suffixes), you need to read the text at the position indicated in the SA in order to compare it to the term you're looking for.

Therefore a corpus like the Linux sources (with some 20 million suffixes) would require about 25 disk seeks even if the suffix array itself were held in memory. At 10ms the seek, that would be 250ms when the file isn't cached...

The first idea that came to mind to minimize the number of seeks was storing the first n bytes of text right after the offset in the SA:

... offset(N) abcdefgh offset(N+1) abcdefgh (first 8 bytes coincide) offset(N+2) abcdefgi ...

Ideally, if suffixes covered the whole "character space" uniformly, n bytes
would represent n * 8 fewer disk seeks. So n = 4, which would double the size
of the SA (offsets are 32 bit integers), would be enough to find any word in
a corpus under 4GB *without a single disk seek* when the SA is held in
memory.

Of course, the distribution of suffixes is hardly uniform. We tend to repeat a limited number of words instead of writing random byte strings... I did some basic n-gram analysis over the documentation from the core/stdlib and some ~80 gems; character-wise to being with (since this is what matters for the full text search engine) and then word-based while I was at it.

This analysis proved that such "inline suffixes" add a large space overhead to the index with relatively little gains, so I adopted a slightly different solution.

### Counting n-grams

If you have FastRI and have used it to build a full-text index of your core/stdlib/gem docs, your ~/.fastri-fulltext/ directory will hold a "suffixes.dat" file (just a large array of 32 bit ints) and a "full_text.dat" file which holds the actual indexed documentation.

Collecting the n-grams is easy:

MAX_SUFFIX_SIZE = 64 NGRAM_SIZES = [4, 8, 10, 12] suffixes = [] File.open("full_text.dat", "rb") do |ft| File.open("suffixes.dat", "rb") do |f| until f.eof? offset = f.read(4).unpack("V")[0] ft.pos = offset suffix = ft.read(MAX_SUFFIX_SIZE).split(/\0/)[0] suffixes << suffix end end end puts "Total suffixes: #{suffixes.size}" ngrams = Hash.new{|h,k| h[k] = Hash.new{|h2,k2| h2[k2] = 0} } all = {} suffixes.each do |suffix| all[suffix] = true NGRAM_SIZES.each{|n| ngrams[n][suffix[0,n]] += 1 } end puts <<EOF =============================== Character-based n-gram analysis =============================== #{all.size} unique #{MAX_SUFFIX_SIZE}-grams A lookup would take on average #{Math.log(suffixes.size)/Math.log(2)} disk seeks. EOF

With my index, I get:

Total suffixes: 224687 196625 unique 64-grams A lookup would take on average 17.7775571295372 disk seeks.

### n-gram statistics

At this point, ngrams[n] is a hash associating n-grams to number of occurrences, and we can try to obtain some interesting statistics:

- number of suffixes which start with the same n bytes (the same n-gram)
- mean, median, maximum number of suffixes per n-gram, as well as stddev
- how many disk seeks we can expect to save on average

The latter is actually the entropy of the n-gram distribution. If there's only one n-gram, knowing the first n bytes doesn't give you any usable info (the entropy is 0), and you don't save any disk seek, all you've done is increase the size of the SA with no gain whatsoever. Now, if there are two n-grams, 50% of the suffixes start with one and 50% with the other, then a binary search will take 1 disk seek less (you just saved 10ms). If we consider a n-gram, we can save at most seeks, iff there are suffixes and each starts with a different n-gram.

The gain (number of disk seeks saved) can thus be computed as

if there are M distinct N-grams each with suffixes, is the number of suffixes starting with the i-th n-gram, and .

def output_ngram_stats(ngrams, n) h = ngrams[n] nsuffixes = h.values.inject{|s,x| s+x} sq_avg = 1.0 * h.values.inject(0){|s,x| s + x ** 2} / h.size avg = 1.0 * nsuffixes / h.size entropy = h.inject(0){|s,(k,c)| p = 1.0 * c / nsuffixes; s - p * Math.log(p)} / Math.log(2) puts <<E Number of #{n}-grams: #{h.size} Mean: #{avg} Median: #{h.values.sort[h.size / 2]} Maximum: #{h.values.sort.last} Stddev.: #{Math.sqrt(sq_avg - avg ** 2)} Gain : #{entropy} bits E end NGRAM_SIZES.each{|n| output_ngram_stats(ngrams, n)}

I get this:

Number of 4-grams: 14616 Number of 10-grams: 104822 Mean: 15.3726737821565 Mean: 2.14350995020129 Median: 2 Median: 1 Maximum: 11857 Maximum: 469 Stddev.: 121.607374446762 Stddev.: 5.35090264226657 Gain : 10.6729310320304 bits Gain : 15.7399473128775 bits Number of 8-grams: 73023 Number of 12-grams: 130326 Mean: 3.0769346644208 Mean: 1.72403818117643 Median: 1 Median: 1 Maximum: 1257 Maximum: 465 Stddev.: 11.1136402566513 Stddev.: 3.27607549288307 Gain : 14.6709814769394 bits Gain : 16.3731975079201 bits

This means that storing the first 4 bytes of text would save about 11 disk seeks (100 ms)*1. That's much lower than the 32 bit maximum.

Storing 8 bytes in each "inline suffix" only saves another 4 seeks, so the extra 4 bytes (4 seeks) aren't nearly as useful as the first 4 (10 seeks). This is expected, as when the first few letters of two words coincide, the probability of the next few also being the same is higher. Note how the gain decreases to about 1 bit per char between 4-grams and 8-grams, which is close to the entropy of English text, estimated between 0.6 and 1.3 bits per char.

### Word-based n-grams

Using the first English stop_words list google gave me to reject suffixes starting with a stop-word, I got the top 2,3,4-grams in Ruby's documentation; no big surprises:

WORD_NGRAM_SIZES = [2, 3, 4] word_ngrams = Hash.new{|h,k| h[k] = Hash.new{|h2,k2| h2[k2] = 0} } stopwords = {} File.foreach("stop_words"){|l| stopwords[l.chomp] = true} all.each do |suffix, | words = suffix.split(/\s+/).map{|x| x.downcase} WORD_NGRAM_SIZES.each do |siz| next unless words.size >= siz w = words[0,siz] next if stopwords[w.first] word_ngrams[siz][w] += 1 end end WORD_NGRAM_SIZES.each do |siz| puts "Top #{siz}-grams:" word_ngrams[siz].sort_by{|k,v| -v}[0...20].each do |words, count| puts "%6d %s" % [count, words.join(" ")] end puts end

499 returns the 170 end end 125 defaults to 281 returns a 158 used to 118 returns an 278 true if 141 creates a 118 return the 234 alias for 139 create a 114 method is 231 returns true 139 value of 108 block is 220 array of 137 list of 107 using the 197 number of 137 does not

203 returns true if 34 used as the 111 true if the 33 wed apr 09 94 returns a new 32 returns a string 84 returns an array 32 based on the 76 creates a new 30 added to the 64 create a new 30 block is given, 59 value of the 28 allows you to 45 returns the number 27 select * from 44 end of the 26 h = { 35 returns nil if 26 according to the

92 returns true if the 18 time.now #=> wed apr 63 returns an array of 18 b' => 200 } 44 returns the number of 17 100, 'b' => 200 25 true if the named 17 returns true if stat 22 available on all platforms. 17 returns a new array 22 a' => 100, 'b' 15 returns a new time 21 boolean | true if 15 t = time.now #=> 20 h = { 'a' 15 specified as a hash. 20 a', 'b', 'c' ] 14 s = stringscanner.new('test string') 20 returns the value of 14 block once for each

### Top (char-based) n-grams

Here are the top n-grams for n = 4, 8, 12:

puts "=" * 80 puts "Top n-grams:" NGRAM_SIZES.each{|n| p ngrams[n].sort_by{|k,c| -c}.map{|k,c| [k[/[^\0]*/], c]}[0..50] }

11857 the 1255 The 910 not 692 new 544 used 2746 and 1187 valu 872 in t 625 data 539 is t 2433 for 1170 will 863 cont 618 give 536 whic 1756 of t 1159 retu 857 from 615 test 529 as a 1528 meth 1103 are 851 call 604 all 528 arra 1498 obje 1091 clas 843 end 600 you 517 fals 1447 name 1040 to t 776 is a 578 inst 499 inte 1409 Retu 1018 file 747 spec 574 if t 1408 this 1000 stri 732 can 558 to a 1383 with 950 true 721 opti 547 defa 1382 that 927 This 696 bloc 545 File

1257 Returns 320 methods 275 associat 231 This is 189 the valu 546 will be 310 with the 268 the curr 230 exceptio 188 of this 468 returns 305 from the 266 characte 223 and the 187 element 452 argument 301 true if 260 an array 218 Errno::E 182 connecti 419 specifie 301 paramete 257 objects 217 require 181 informat 410 attribut 300 the give 253 returned 210 This met 180 elements 388 instance 292 transact 248 director 202 a string 177 availabl 352 Alias fo 290 document 242 number o 200 the bloc 177 variable 329 for the 285 the name 242 the same 197 options 177 function 326 default 284 current 236 represen 195 array of 172 of the c

465 Returns the 99 the name of 69 value of the 252 the current 98 the document 65 an exception 249 Returns true 95 correspondin 65 converted to 210 This method 89 the associat 64 Transaction: 156 an array of 88 Returns a ne 64 containing t 136 true if the 88 documentatio 64 false 126 ActiveRecord 84 automaticall 61 Returns an a 126 the number o 83 the specifie 61 is specified 119 transaction 80 the value of 61 defaults to 116 information 80 end 61 true 111 returns the 79 the attribut 61 argument is 110 name of the 77 representing 60 example.com' 110 the default 76 Defaults to 58 the named fi 107 the followin 73 Creates a ne 58 Julian Day N 106 this method 72 transaction_ 57 the end of t 105 produces: 70 object. 56 000\000\000\ 100 and returns 70 Equivalent t 55 a', 'b', 'c'

*1 since that corpus is so small, most of the full text file will be cached after the first seeks, so it will take much less than 17 disk seeks normally. However, the decrease in the number of disk seeks will hold for much larger corpora if the N-gram distribution is similar.

- 31 http://anarchaia.org
- 15 http://www.artima.com/forums/flat.jsp?forum=123&thread=189574
- 15 http://del.icio.us/bobchm/nlp
- 11 http://rubyblogs.org/aggy/tag/array
- 10 http://www.artima.com/buzz/community.jsp?forum=123
- 7 http://www.anarchaia.org
- 4 http://planetruby.0x42.net
- 4 http://rubycorner.com
- 3 http://anarchaia.org/archive/2006/12/24.html
- 2 http://www.artima.com/buzz

Keyword(s):[blog] [ruby] [frontpage] [suffix] [array] [n-gram] [analysis] [ftsearch]

References: