eigenclass logo
MAIN  Index  Search  Changes  PageRank  Login

Naïve suffix array search engine vs. Ferret vs. Lucene

Motivated by this performance comparison of Montezuma vs. Ferret and Lucene, I tried to see how my toyish full-text search engine based on suffix arrays fares against Ferret: benchmark.png

If the relative performance between Ferret and Lucene shown in the Montezuma benchmark holds, my toy engine is about 2x faster than Lucene when indexing small corpora.

Here are the scripts I used (taken from Ferret's wiki):


As expected, searching is quite slow. With a hot cache, this is what I get when searching half of linux-2.6.12's sources:

$ ruby lookup.rb test_FULLTEXT test_INDEX 
Input term: sprintf
Needed 0.002611 seconds for the first match.
Needed 1.71507 to collect 5441 hits (in 1118 docs).

I don't think there's much I can do to get below those 2.6ms for the first match, but collecting all the hits is way too slow, taking 0.3 ms per occurrence. This was fully expected because of the way the document URI for a match is obtained: it's essentially stored past the document data, so (as much as) the whole document is read before its URI is found. This was OK with the small RDoc corpus (with small articles) used by FastRI, but it's way too slow when there are many hits and documents are longer*1 Moreover, the document metadata is being loaded once per hit.

Also, when the cache is cold the first match takes longer --- maybe as much as 1/10th of a second. This is due to the O left ( log N right ) non-sequential accesses to the text during the binary search. Since suffix array algorithms implicitly assume that the suffix array will be in main memory, I can make the structure a bit richer to avoid at least some disk seeks. There are some O left ( m right ) (m == pattern length) searching algorithms, but they'll certainly involve a high constant overhead in Ruby, so for the corpora sizes I'm aiming at (things in the GB range) a simple revision of the basic O left ( log N right ) algorithm will perform better (in less code).

*1 Consistently with that hypothesis, the per-hit times are smaller for FastRI's shorter documents.