Indexing faster than Ferret with some algorithmic help (an order of magnitude faster than Lucene? can't be)
I've realized that my initial performance comparisons were flawed because the index included neither the text nor the term vectors. According to Ferret's documentation, (and a basic understanding of inverted indices) term vectors are needed for creating search result excerpts and performing phrase searches. Also, since the index based on suffix arrays has a copy of the original text, it makes sense to have it stored in Ferret's index if the comparison is to be meaningful.
Meanwhile, I've also discovered that Ferret is much slower than I thought when you actually try to do something with the results, such as getting their URIs (otherwise, all you have is an internal document ID that doesn't tell you anything). In some quick tests, it needed over 0.30 seconds to return 1165 hits when looking for "sprintf" in linux's sources after a few runs, and over 8 seconds when the cache was cold. I think both figures will be quite easy to beat, but that will come later --- I want indexing to be fast to begin with, as I'll be running the indexer often while I develop this.
I've rewritten the indexer, made it modular (e.g. it can index documents with multiple fields, using different analyzers on each), and then implemented a couple functions in C --- some 150 lines of C, compared to Ferret's >50000... This is the beginning of FTSearch (I'll soon put the darcs repository online).
Here's how FTSearch compares to Ferret right now:
I used the Reuters corpus, with 19478 documents amounting to 18MB worth of text, and some 2.7 million suffixes (space-separated "words").
The sys time is suspiciously high in Ferret's case, so I might be doing something wrong. I'm reusing the indexer found in Ferret's wiki modified to store the fields and term vectors, with these settings:
:merge_factor => 100, :use_compound_file => true, :max_buffer_memory => 0x10000000, :max_buffered_docs => 20_000
There are under 20000 documents in the corpus and Ferret stays under 256MB RSS while indexing, so it shouldn't be hitting the disk too often --- at least, I'm not asking it to on purpose.
As for the index sizes, FTSearch's takes 30372KB and Ferret's 46516KB.
Two operations were taking most of the time: finding the suffixes, and sorting them. The first one can be rewritten trivially in C in a few lines and isn't that interesting per-se. The sorting problem, on the other hand, is a classic.
I was originally doing a mere
which should have been fairly fast, as far as Ruby code goes, since it was essentially calling qsort() on the strings. But there were two problems:
- allocating one string per suffix takes too much memory (at least 20 bytes, and only if we write this carefully to benefit from Ruby's copy-on-write, possibly more)
- you can easily get faster than qsort when sorting strings
I've used Bentley & Sedgewick's multikey quicksort (a sort of blend of Quicksort and radix sort) modified to use insertion sort for small subarrays, and to limit the depth (and hence the maximum query size, currently to terms under 256 bytes). It's easily an order of magnitude faster than qsort.
Another advantage is that it needs less memory, so I could test the indexers on a larger corpus. I indexed all the .c and .h files in Linux 2.6.12's sources: 14252 files, 160MB of data, over 20 million suffixes on a machine with 512MB RAM. These are the results:
Ferret scales a bit better, (
after all sorting is O(N log N)), but this doesn't really matter as the index will be split into multiple fragments (searching will become slower, though).
I'm going to rework the lookup system, applying some basic extensions to the suffix array to avoid excessive disk seeks, and using a document table to map hits to document IDs/URIs. I expect this to be fairly fast per-se, even in pure-Ruby, but I'm currently feeling I might have a speed demon if I also use ruby-mmap and maybe rewrite a couple functions in C. Let's see.
Why don't you use Bucket Sort ( http://en.wikipedia.org/wiki/Bucket_sort )? Non comparison sorts are Theta(n) (Since bucket sort is not a comparison sort, it is not subject to the W*(n log n) lower bound).
Demo groups have used bucket sort (or radix sort) for quite some time now to avoid the nlogn complexity of regular qsort.
If you are clever enough you can do the main loop of bucket sort in one (short) line of C code without any comparisons at all. It can be extremely fast (i remember quick sort taking a full scan to finish and bucket sort taking only a scanline).
Qsort is useful when you need to have a comparison operator between the elements you are sorting. When you don't need a comparison operator qsort is a lot slower than algorithms like bucket sort.
hmm is bucket sort directly utilizable in this case (i.e. wouldn't we rather need a radix sort)?
Quoting from Bentley and Sedgewick :
The primary challenge in implementing practical radix sorts is the case when the number of distinct keys is much less than the number of bins (either because the keys are all equal, or because there are not many of them). Multikey Quicksort may be thought of as a radix sort that gracefully adapts to handle this case, at the cost of slightly more work when the bins are all full.
(BTW., it's called multikey Quicksort, but it's not at all like qsort(), no strcmp() used)
They claim that multikey Quicksort is competitive with the fastest sorts around (in particular, they say that it was sometimes faster than radix sort), so I don't know if there's much to gain here, but I'll give it a try eventually.
Anyway, I just verified that only half of the time it takes to index the linux sources is spent sorting the array, so there's probably more to be gained elsewhere.
You are a hero! And not only performance counts... who can maintain 50kLoC???
nae for the time being this is just a toy; inverted indices will scale better, but I think it can be competitive up to a few GBs worth of data.
Any interest in packaging this as a gem after you put up the repository?
Thanks for the work, it's fun following along.
If it becomes release-worthy, the std release will include
- platform-independent gem (requires C toolchain)
- win32 binary gem (cross-compiled)
since I couldn't find a better place to post bugs of fri, here goes:
produces a crash. ('' is required since it completes the *'s otherwise)
Thanks, I had forgotten to escape the regexps. It's fixed in HEAD http://eigenclass.org/repos/fastri/head
If you need the fix right away, here's the patch: http://www.eigenclass.org/sp.cgi/view/kokg9j
You can also report bugs via rubyforge (http://rubyforge.org/projects/fastri/ ), by direct email (put fastri in the subject & my procmailrc will let it through), or even by posting to ruby-talk (I'll normally notice it).
I'm using ferret now, and would love to get a speedup in retrieving the docs. indexing about 700k docs right now. Any idea when you might put the code up?
*1 incidentally, I've realized that those figures were probably taken when storing documents and term vectors
- 30 http://www.artima.com/forums/flat.jsp?forum=123&thread=186219
- 22 http://planetruby.0x42.net
- 20 http://anarchaia.org
- 18 http://rubyblogs.org/aggy/tag/ftsearch
- 12 http://www.artima.com/buzz/community.jsp?forum=123
- 10 http://www.rubycorner.com
- 9 http://www.anarchaia.org
- 9 http://forum.rubyonbr.org/forums/4/topics/340
- 9 http://www.planetrubyonrails.com
- 7 http://rubyblogs.org/aggy/tag/ferret