This example shows that in some cases immutable data can be faster than mutable state. One reason is that language implementations with incremental, generational or concurrent GCs (a few combine all three properties at once, like HotSpot's; OCaml's is only generational and incremental) typically use write barriers which add overhead to each heap pointer store (read barriers are normally more expensive). In particular, generational GCs use write barriers to maintain the remembered set: a list of references from an older generation to a younger one.
The synthetic benchmark operates on "simple records intended to emulate typical business entities":
module P0 = struct
type person = { name: string; age: int; balance: float }
end
module P1 = struct
type person = { name: string; mutable age: int; mutable balance: float }
end
The balance field is updated repeatedly, by creating a new record in the immutable case, and by changing the field directly in the mutable one. floats are generally boxed in OCaml (the main exceptions are float arrays and records holding only floats), so a person record takes 6 words on x86-64 (7 on x86, since the double value takes 8 bytes). In the first case, each iteration allocates 48(28) bytes; in the second one, only 16(12) are allocated, but the field update incurs into the write barrier overhead: a call to the caml_modify routine, that checks whether a pointer to a value in the younger generation is being stored in a block residing in the old (major) heap. As it turns out, the extra allocation is faster than the write barrier:
Immutable record: 138644594.442718 updates/s Mutable record: 79740984.161326 updates/s
The example can be modified to illustrate clearly the effect of the write barrier, and when the latter is needed at all.
If the balance field is changed to an int, the stored value is known not to reside in the minor heap (integers, bools, chars and constant constructors are immediate), so the write barrier is not needed, and no call to caml_modify will be generated. Moreover, there will be no allocation at all in the mutable case, since there are no boxed floats involved anymore, so we can compare the cost of allocating a 4-word record to that of updating a single field:
Immutable record: 277762346.536304 updates/s Mutable record: 499975001.249938 updates/s
Similarly, the balance field can be "boxed manually", so to speak, to avoid allocations in the mutable case, by making it of type
type ufloat = { mutable value : float }
This takes exactly as much space as a regular boxed float would, but allows to update the value without any allocation in the mutable case:
Immutable record: 147050173.519205 updates/s Mutable record: 277762346.536304 updates/s
Finally, if all other fields are removed, we are left with
module P0 = struct
type person = { balance: float }
end
module P1 = struct
type person = { mutable balance: float }
end
Since the records only hold values of type float, they will be unboxed, so the benchmark now measures the relative costs of allocating 16(12) bytes and updating a double value in memory:
Immutable record: 224603010.554612 updates/s Mutable record: 229258325.999858 updates/s
Allocation is fast enough to make the difference insignificant compared to the cost of the float operations (balance +. 1.0) and the loop overhead. This is the code generated on x86-64 for the mutable case:
.L110:
movlpd .L112(%rip), %xmm0
addsd (%rax), %xmm0
movlpd %xmm0, (%rax)
movq %rdi, %rsi
addq $2, %rdi
cmpq %rbx, %rsi
jne .L110
corresponding to
for i = 1 to iters do p.balance <- p.balance +. 1. done
It could clearly benefit from loop unrolling plus code hoisting and avoid memory traffic by keeping the intermediate result in the %xmm0 register, but at least doesn't involve any allocation and doesn't incur into the caml_modify overhead.
Comments
Excellent analysis, Mauricio. What I find surprising is that such simple code "violates" so easily the expected usage pattern of the generational GC: an old object shouldn't point to the young generation.
Would a float ref substitute for the single mutable float record?
How are you timing this? It seems kinda sensitive, and I have no idea how you'd instrument this.
Matías: the float in
float refis not unboxed; iow. you're left with the boxed float instead the ref's own box.tyler: the original code just evaluates several million times a function that does
so the loop, the float allocation and the actual operation add overhead. It's possible to measure the cost of
caml_modifymore directly by performing many updates per iteration and avoiding any computations/allocation. I show this in the followup to this post.