Outperforming Ferret at searching, 3X faster indexing, code online
The last time I blogged about the FTSearch (simple) full-text search engine, it already indexed the Reuters corpus over twice faster than Ferret. I have rewritten a few more methods in C and got an extra 50% speed boost, making it now over 3 times faster than Ferret when indexing.
Today's news is that I've implemented enough functionality to begin to compare searching performance. And FTSearch does often outperform Ferret (but read below).
FTSearch is currently powered by some 500 lines of C and 800 in Ruby, compared to over 50000LoC in C for Ferret. Of course, it's far from being as featureful as Ferret, but the #1 advantage of being small is that it's easier to debug, and I think it can be made more reliable than Ferret with relatively little effort (Ferret often crashed with a segfault while I was timing it; for instance, it was really angry at me when I searched Linux' sources for "void").
You can have a look at the code at http://eigenclass.org/repos/ftsearch/head
Here are some times for the Linux corpus (160MB of .c and .h files, 20 million suffixes).
$ ruby ferret-lookup.rb Input term: sprintf Needed 0.050867 for the search. Needed 8.83525 to get the URIs. Total time: 8.886119 Total matches: 1165 Showing top 10 matches: 0.100 corpus/linux/drivers/usb/host/uhci-debug.c 0.097 corpus/linux/drivers/scsi/aic7xxx_old/aic7xxx_proc.c 0.090 corpus/linux/drivers/isdn/hisax/q931.c 0.089 corpus/linux/drivers/s390/sysinfo.c 0.086 corpus/linux/drivers/pci/hotplug/cpqphp_sysfs.c 0.086 corpus/linux/drivers/pci/hotplug/shpchp_sysfs.c 0.083 corpus/linux/drivers/mca/mca-proc.c 0.077 corpus/linux/drivers/dio/dio-sysfs.c 0.077 corpus/linux/drivers/net/wan/lmc/lmc_debug.c 0.073 corpus/linux/drivers/media/radio/miropcm20-rds.c 0.072 corpus/linux/drivers/video/console/promcon.c Input term: sprintf Needed 0.004382 for the search. Needed 0.322449 to get the URIs. Total time: 0.326834 Total matches: 1165 [...]
$ ruby sample-lookup.rb Input term: sprintf Needed 0.085965 for the search. Needed 0.008282 to rank 5556 hits into 1176 docs Needed 0.001414 to get the URIs. Total time: 0.095685 Showing top 10 matches: 44576 4913 corpus/linux/drivers/scsi/aic7xxx_old/aic7xxx_proc.c 41280 5073 corpus/linux/drivers/usb/host/uhci-debug.c 39200 4647 corpus/linux/drivers/s390/sysinfo.c 38720 4557 corpus/linux/drivers/pci/hotplug/cpqphp_sysfs.c 38704 4582 corpus/linux/drivers/pci/hotplug/shpchp_sysfs.c 34210 3640 corpus/linux/drivers/isdn/hisax/q931.c 31556 3718 corpus/linux/drivers/mca/mca-proc.c 27920 3839 corpus/linux/drivers/media/radio/miropcm20-rds.c 26925 4425 corpus/linux/drivers/net/wan/lmc/lmc_debug.c 25770 3228 corpus/linux/drivers/dio/dio-sysfs.c 25699 5293 corpus/linux/drivers/video/console/promcon.c Input term: sprintf Needed 0.000508 for the search. Needed 0.007636 to rank 5556 hits into 1176 docs Needed 0.001463 to get the URIs. Total time: 0.009619 [...]
$ ruby sample-lookup.rb Input term: big array Needed 0.096039 for the search. Needed 0.000118 to rank 3 hits into 3 docs Needed 1.2e-05 to get the URIs. Total time: 0.096197 Showing top 10 matches: 24752 10402 corpus/linux/include/asm-i386/mach-generic/irq_vectors_limits.h 24752 10414 corpus/linux/include/asm-i386/mach-summit/irq_vectors_limits.h 3529 685 corpus/linux/arch/i386/mm/boot_ioremap.c Input term: big array Needed 0.000571 for the search. Needed 6.9e-05 to rank 3 hits into 3 docs Needed 1.0e-05 to get the URIs. Total time: 0.000659 [...]
$ ruby ferret-lookup.rb Input term: "big array" Needed 0.162661 for the search. Needed 0.048348 to get the URIs. Total time: 0.211011 Total matches: 3 Showing top 10 matches: 0.184 corpus/linux/include/asm-i386/mach-generic/irq_vectors_limits.h 0.184 corpus/linux/include/asm-i386/mach-summit/irq_vectors_limits.h 0.069 corpus/linux/arch/i386/mm/boot_ioremap.c Input term: "big array" Needed 0.001285 for the search. Needed 0.000368 to get the URIs. Total time: 0.001652
If you look at the times shown above, you'll notice that Ferret takes forever to fetch the URIs. FTSearch's problem is that the time required to rank the results is proportional to the amount of hits, and there might be several hits in a matching document (e.g. several "sprintf" in a single .c file). This becomes a problem if you look for an overly general term, like, say, all words starting with 'w' in the linux sources:
$ ruby sample-lookup.rb Input term: w Needed 0.102408 for the search. Needed 0.579098 to rank 289186 hits into 11313 docs Needed 0.017202 to get the URIs. Total time: 0.698738 [...] Input term: w Needed 0.000537 for the search. Needed 0.424715 to rank 289186 hits into 11313 docs Needed 0.016367 to get the URIs. Total time: 0.441637
As you can see, that's half a second to get 11313 matching documents with over 289186 hits. Compare it to Ferret:
$ ruby ferret-lookup.rb Input term: w* Needed 0.101505 for the search. Needed 7.163373 to get the URIs. Total time: 7.264879 [...] Input term: w* Needed 0.051435 for the search. Needed 0.331753 to get the URIs. Total time: 0.383191 Total matches: 1268
Ignore for now the fact that Ferret returns fewer matches (probably due to the stopword list). And ignore Ferret's 7 seconds (!!) with a cold cache (although that might be hard to overlook in many situations).
With a hot cache, Ferret's search time was 51ms, which is much slower than FTSearch's 0.5ms. Ferret can give you a list of documents ranked by their scores in that time (of course, I'm assuming it wouldn't be much slower if it return the >10000 docs it should). FTSearch has got to bucket the 290000 hits into 11313 documents and then sort them, which takes 0.42s. This means that if you want e.g. the top 50 matches (along with their URIs), Ferret can have them ready for you in less than 10ms, and FTSearch would still need 0.4s --- because it has to sort *all* matches.
Is that acceptable? The first answer is "no, it sucks!". However, there are some things to consider:
- most searches are not that general, if you look for something more than a couple letters long you'll get much fewer matches
- I haven't optimized document scoring for real yet; even if 0.4s to search all the linux sources are unacceptable, would you say the same about 0.04s?
- can we always discard the time for the first run? If not, FTSearch is still in the same order of magnitude as Ferret --- and I can make it faster than that, keeping the speed advantage up to hundreds of thousands of documents, and several GBs worth of data
- I can do probabilistic scoring: I can rank the documents in time, while giving guarantees like: "if the document has got x times more hits than the average, it'll be within the top M docs with some probability p (close to 1)".
- there are other ways to score documents besides tf-idf; proximity search, for one, might be more interesting
I have to address the scoring issue (I think I can make it at least an order of magnitude faster just by changing the document map implementation), and start to work on chaining several "fragments" (parts of the corpus indexed separately) and merging indices, so that incremental indexing becomes practical. Other things in the TODO: multiple suffix arrays bound to a single full-text store, more analyzers, column-based storage of some values...
Hi Mauricio - I must say I'm enjoying reading about what you're doing here.
However, if you really want a challenging search/indexing engine to benchmark against, have a look at Zettair - it is being written by a friend of mine doing his PhD at RMIT University here in Melbourne, Australia. It is built entirely for speed and efficiency, and it is all based on some of the most cutting edge research in IR today.
I thought you might find it interesting, and perhaps pick up a few pointers (no pun intended) from there :)
Oh, I should mention, it is optimised for parsing HTML and TREC collections, but arbitrary file type handlers can be added.
Thanks for the link! Actually, I was hoping to get pointers to other search engines when I wrote this ;-)
hi nice site.
I've read this casually, and I was struck by the phrase "because it has to sort all matches". I don't know the context, and you probably know this, but to just get the top K from a list of N one would generally filter results through a min-heap to obtain those K and then sort the K elements. To get the next K element one can remember the previous threshold and again use a min-heap et cetera. The filter stage takes O(N*log(K)) rather than O(N*log(N)) as sorting does.
This is what happens actually: I'm performing binary search to obtain the suffixes matching the query term in O(log N) time. At that point you have M hits (each is but a 32 bit integer, for the time being, telling you the offset within the full-text store) which have to be bucketed into Q (Q << M) documents. Documents are scored based on the number of hits (with different weights for each field, e.g. a match in the URI is more important). So for each hit (= suffix, i.e. an offset), I do
score[doc_id_for_offset(offset)] += weight[field(offset)] / field_length(offset)
and then at last the documents are sorted based on their scores.
So in addition to sorting the matches, I have to compute their scores; in practice, this takes (much) longer than the actual sort, since Q << M holds normally. If the last stage ever becomes the bottleneck, I'll either use a heap as you suggest or some O(n) sorting (bucket, radix).
This sounds really exciting. I must admit, for the project I'm working on that requires full text search, I'm not *that* fussed about performance, particularly indexing performance. What I *really* need is stability, and it sounds like FTSearch is going to fit the bill there, where Ferret doesn't...
Today I've fallen on Xapian ( http://www.xapian.org/ ). They also seem to have ruby bindings. Maybe it's interesting ? I don't know myself because I've never investigated text indexing.
I'll have a look. This part of the feature list caught my attention:
Ranked probablistic search - important words get more weight than unimportant words, so the most relevant documents are more likely to come near the top of the results list.
Probabilistic, hmm. Looks like they have the same problem (ranking docs is expensive) I had with FTSearch.
The performance is there, but you need access to the document at query time. This means that you either copy all the data to a place where you have it handy (like in the proto) or that you need to load the documents when you need them in the binary search. If you can live with the copying (so storage requirement is more than twice the original data size) or you have all your documents at your fingertips, this is one of the best approaches. otherwise, use reverse indexes.
- 48 http://www.artima.com/forums/flat.jsp?forum=123&thread=187201
- 47 http://rubyblogs.org/aggy/tag/ferret
- 39 http://anarchaia.org
- 24 http://www-scf.usc.edu/~adiazar/resources.html
- 24 http://woss.name/2006/12/06/links-for-2006-12-06
- 20 http://www.anarchaia.org
- 18 http://planetruby.0x42.net
- 10 http://rubyblogs.org/popular/d/7
- 9 http://www.artima.com/buzz/community.jsp?forum=123
- 9 http://www.chesssetsmarket.com