02 Jully 2009 at 13:22 Hash tables: separate chaining vs. double hashing

In my earlier finite map benchmarks which compared several hash tables and functional maps (AVL trees and ternary trees), separate chaining (with $$\alpha = 0.47$$) beat double hashing ($$\alpha = 0.42$$) at unsuccessful searches (1% hit rate), but was slower at successful ones.

Theory predicts the following average number of probes for unsuccessful and successful lookups (expressions below): Unsuccessful searches, same load factor Successful searches, same load factor

Separate chaining looks much better here, but the graphs are misleading, because we don't actually care as much about the load factor as about memory usage, so we want to compare the collision resolution schemes when the latter is the same.

If we use a linked list for the collision chain, this represents one word of overhead per item compared to double hashing. For instance, if the load factor with separate chaining is 1, we can afford to have a table twice as big with double hashing for the same amount of memory and a load factor of 50%. In other words, $$N + n = N'$$ where $$N$$ and $$N'$$ are the sizes of the tables for separate chaining and double hashing, and $$n$$ the number of elements. The respective load factors are

\[\begin{eqnarray*} \alpha & = & \frac{n}{N}\\ \alpha' & = & \frac{n}{N'}\\ & = & \frac{n}{N+n}\\ \alpha' & = & \frac{\alpha}{1+\alpha}\end{eqnarray*} \]

The expressions for the average number of probes for unsuccessful and successful searches are (Knuth, TAOCP Vol 3, Section 6.4), for separate chaining

Read more...

02 Jully 2009 at 10:21 Math typesetting with jsMath

I've added math typesetting support to Ocsiblog, which runs this site, using jsMath: (double-click on the equations to see the sources)

\[\begin{eqnarray*} \nabla.D & = & \rho\\ \nabla.B & = & 0\\ \nabla\times E & = & -\frac{\partial B}{\partial t}\\ \nabla\times H & = & j+\frac{\partial D}{\partial t}\end{eqnarray*} \]

When jsMath is enabled, you'll see in the bottom right-hand corner a button that triggers a control panel allowing you to set several options such as the scale or the rendering mechanism (TeX fonts, image fonts or native Unicode fonts; you can pick the one that works best for you and save the configuration --- the best one is TeX fonts, if you have them, but Unicode fonts is quite decent). For best viewing, you can get the TeX fonts here (the page includes detailed install instructions for Windows, OSX and Un*x).

Here's what the above equations should look like with TeX and native Unicode fonts: TeX Unicode

19 June 2009 at 10:16 Finite maps galore: imperative code strikes back

Matías Giovannini has been implementing purely functional data structures based on Bentley and Sedgewick's ternary search trees (TST) and pitching them against OCaml's imperative hash table (Hashtbl). In the interval between the initial results and his showing his first version (called Trie_map from now on), I made a functional TST ("Ternary") which compares favorably to both Trie_map and the revised implementation ("Trie_map_mod").

I also resuscitated a hash table implementation of mine that uses open addressing with double hashing (in contrast to Hashtbl's external chaining) and exhibits vastly improved performance.

I'll show the benchmark results first, to be followed by some code and analysis.

Benchmarks

I benchmark the following finite map implementations:

  • imperative:

    • Fasthashtbl: hash table with open addressing and double hashing

    • Hashtbl: the hash table from INRIA's stdlib (external chaining)

    • Hashtbl_mod: Hashtbl with more aggressive resizing (lower load factor)

    • Hashtbl_hval: Hashtbl_mod with caching of the hash value

  • functional:

    • Trie_map: TST with coalesced constructor for nodes with and without values

    • Trie_map_mod: TST with different constructors for leaves and inner nodes with/without a value

    • Ternary: TST with separate constructor for nodes with and without values but no leaf constructor

    • Map: the Map implementation from INRIA's stdlib (AVL tree)

All the finite maps are compared by

  1. building a map with 98568 English words, which are added in (a) randomized and (b) lexicographic order

  2. measuring the minimal lookup cost with constant lookups (so that everything is in the L1 cache)

  3. timing successful lookups by looking for the 98568 words

    1. in the same (random) order the elements were added

    2. in randomized order

    3. in lexicographic order

    4. in reverse lexicographic order

    ... against the maps built in randomized and lexicographic order

  4. timing searches against the previously generated maps (randomized and lexicographic order) using

    1. a larger set of English words (217625, 45.3% hit rate)

    2. a set of Spanish words (86016 words, 1.3% hit rate)

    ... in randomized and lexicographic order

For the sake of exhaustiveness, I measure the iteration and functor overhead and detract it from the lookup timings. The total size of the structures (including the size of the strings) is also measured. All the measurements are repeated 20 times, and the best time is retained. The benchmarks are performed in a single program execution to achieve steady state (major heap resized to required size). The set of words used in the map is large enough to ensure the structure (taking 5.7MB to over 12MB) doesn't fit in the 2MB L2 cache.

The full results can be found here (also includes hash table benchmark with integer keys).

This table shows the results for a map built in randomized order (refer to the full output for those in lexicographic order). Speed is measured in ops/sec, and size in bytes ("Find (hit, orig)" means that the lookups are performed in the same random order as the insertions).

Insertion (randomized order) and lookup

Data structure Size Load factor Insertion (rand) Find (constant) Find (hit, rand) Find (hit, orig) Find (hit, sorted) Find (45.3%, rand) Find (1.3%, rand) Find (1.3% sorted)
FastHashtbl 7325544 0.376 1509558 16701222 2669067 3554728 3958575 2347248 3243197 5133702
Hashtbl 5752664 1105131 5617227 2136562 2296294 2791035 1623434 1731465 2229713
Hashtbl_mod 7325528 0.376 639994 7373058 2711193 2963744 3893034 2249186 2926475 4403349
Hashtbl_hval 8114072 0.376 675892 10457458 2769387 3056772 4153404 2435018 3436621 5542555
Ternary 11894440 179599 14739138 1484213 1556592 4576735 1567019 2419841 6466320
Trie_map 99834 11870073 985476 1024257 3022102 1085412 1797166 5763597
Trie_map_mod 12417768 164725 8476101 1284442 1340733 3246878 1417483 2151898 4811496
Map 6805440 167046 1168963 674209 671368 1026645 556788 633682 986727

Implementation notes

The source code for all the containers and the benchmarks can be obtained here.

I modified Hashtbl first by changing its resizing policy so that the load factor is the same as in Fasthashtbl (Hashtbl_mod), and second by caching the hash value in the nodes (Hashtbl_hval), which makes successful lookups faster when there are collisions (since the key comparison can be bypassed when the hash values differ) and helps a lot with unsuccessful ones.

Fasthashtbl, Hashtbl_mod and Hashtbl_hval only allow the load factors to grow to 50% before resizing the underlying array to twice its former size; the average load factor is thus 37.5%. Hashtbl, on the other hand, allows it to grow to 2.0, which, theory predicts, would require 13.4 probes per unsuccessful search, and 4.6 per successful one (Knuth, TAOCP Vol 3, Section 6.4, page 524). For a 37.5% load factor, Hashtbl, Hashtbl_mod and Hashtbl_hval will require 1.1 and 1.2, and Fasthashtbl (which uses open addressing with double hashing) will need 1.6 and 1.25 --- double hashing is about as good for successful lookups but worse at unsuccessful ones.

Trie_map_mod and Ternary are very similar: the only differences are that the latter doesn't have a separate Leaf node type, and its code has undergone a few manual optimizations explained below.

Some observations

Functional vs. imperative

Read more...

29 May 2009 at 15:16 Making mutable state faster: optimizing the caml_modify write barrier

As I explained yesterday, the caml_modify write barrier needed by OCaml's incremental + generational GC is the main reason why immutable data can often be faster than mutable state. I've played a little with it and achieved a >50% speed boost in synthetic benchmarks, which could map to up to two-digit percentage gains in actual programs (in particular, it can make Array.map, init, make and the likes over 20% faster). The mutable state part from the synthetic business entity update benchmark becomes 32% faster.

How the write barrier works

caml_modify needs to record the references from values in the major heap to the minor heap. The code is straightforward:

#define Modify(fp, val) do{                                                 \
  value _old_ = *(fp);                                                      \
  *(fp) = (val);                                                            \
  if (Is_in_heap (fp)){                                                     \
    if (caml_gc_phase == Phase_mark) caml_darken (_old_, NULL);             \
    if (Is_block (val) && Is_young (val)                                    \
        && ! (Is_block (_old_) && Is_young (_old_))){                       \
      if (caml_ref_table.ptr >= caml_ref_table.limit){                      \
        CAMLassert (caml_ref_table.ptr == caml_ref_table.limit);            \
        caml_realloc_ref_table (&caml_ref_table);                           \
      }                                                                     \
      *caml_ref_table.ptr++ = (fp);                                         \
    }                                                                       \
  }                                                                         \
}while(0)

fp is the block slot where the (suspected) pointer is to be stored. caml_modify first checks if the destination is in the major heap with the Is_in_heap macro I'll come back to in a minute. Is_block(v) is true when v is not an immediate value. There's an interesting piece of logic that tries to avoid recording the same reference twice: if the previous reference _old_ held in fp was to a block in the minor heap, we know it's been already added to the ref table and there's no need to do it again. In the worst case, when updating N times the same slot with values in the minor and the major heap alternatingly, the reference will be recorded N/2 times --- it seems very unlikely this would happen in practice, though, as the ref table is cleared on each minor GC run.

The main problem with the above routine is that Is_in_heap is fairly slow. On 64-bit architectures it involves a lookup in a sparse hash table to see if the corresponding page is being managed by OCaml (on 32 bit, a two-level array is used). The reason why this check cannot just be implemented as a negated "is it in the minor heap" test (two pointer comparisons at most) is that it is possible to have blocks which belong neither to the minor heap nor to the major one --- iow., memory areas not managed by OCaml at all, and which are not to be traversed by the GC.

Speeding it up

When we're not in the Phase_mark GC phase, the order of the checks can be altered so the expensive Is_in_heap test is only performed when the value to be stored resides in the minor heap and the old one didn't before. This way, when the same slot is being updated several times, Is_in_heap will only be used once (with the exception explained above).

The resulting code, albeit quite ugly because of the code duplication, is considerably faster. I measure the gain with two trivial programs:

Read more...

28 May 2009 at 10:49 When immutable data is faster: write barrier costs

This example shows that in some cases immutable data can be faster than mutable state. One reason is that language implementations with incremental, generational or concurrent GCs (a few combine all three properties at once, like HotSpot's; OCaml's is only generational and incremental) typically use write barriers which add overhead to each heap pointer store (read barriers are normally more expensive). In particular, generational GCs use write barriers to maintain the remembered set: a list of references from an older generation to a younger one.

The synthetic benchmark operates on "simple records intended to emulate typical business entities":

module P0 = struct
  type person = { name: string; age: int; balance: float }
end

module P1 = struct
  type person = { name: string; mutable age: int; mutable balance: float }
end

The balance field is updated repeatedly, by creating a new record in the immutable case, and by changing the field directly in the mutable one. floats are generally boxed in OCaml (the main exceptions are float arrays and records holding only floats), so a person record takes 6 words on x86-64 (7 on x86, since the double value takes 8 bytes). In the first case, each iteration allocates 48(28) bytes; in the second one, only 16(12) are allocated, but the field update incurs into the write barrier overhead: a call to the caml_modify routine, that checks whether a pointer to a value in the younger generation is being stored in a block residing in the old (major) heap. As it turns out, the extra allocation is faster than the write barrier:

Immutable record: 138644594.442718 updates/s
Mutable record:   79740984.161326 updates/s

The example can be modified to illustrate clearly the effect of the write barrier, and when the latter is needed at all.

If the balance field is changed to an int, the stored value is known not to reside in the minor heap (integers, bools, chars and constant constructors are immediate), so the write barrier is not needed, and no call to caml_modify will be generated. Moreover, there will be no allocation at all in the mutable case, since there are no boxed floats involved anymore, so we can compare the cost of allocating a 4-word record to that of updating a single field:

Immutable record: 277762346.536304 updates/s
Mutable record:   499975001.249938 updates/s

Similarly, the balance field can be "boxed manually", so to speak, to avoid allocations in the mutable case, by making it of type

type ufloat = { mutable value : float }

This takes exactly as much space as a regular boxed float would, but allows to update the value without any allocation in the mutable case:

Immutable record: 147050173.519205 updates/s
Mutable record:   277762346.536304 updates/s

Finally, if all other fields are removed, we are left with

Read more...