A common misconception from non-Haskellers is that Haskell, when compiled, pays
an ongoing penalty for supporting laziness by default. The idea is that in a
lazy language, every expression would suspend to a heap-allocated, garbage
collected, thunk. Even if you were to use the variable immediately.
That sounds scarily expensive.
In the real world however, Haskell is often as not a strict language.
Compiled Haskell undergoes strictness analysis which filters out the "needs to
be lazy" code, from the "just do it already" strict stuff. The end result is
purely functional code, written in a lazy style, that gets compiled down to raw,
strict assembly, without any thunks, heaps, or GC getting in the way of raw,
all out speed.
Like type inference, this means we usually don't have to manually annotate code
with its strictness (or laziness) properties, instead, the compiler figures it out
from how we use the values.
Here's a nice example, from a new arrays library, based on the larger
data parallel
arrays project.
We'll write some code to generate 3 arrays, add them together, then bit shift
all the elements, and finally, take the sum.
In a language like C, this would mean first allocating a bunch of arrays, writing
loops to fill them, then separate loops to do the the rest of the pipeline.
If we were looking for speed, we'd perhaps try to manually fuse some of the loop bodies together
to avoid extra traversals, and deforesting intermediate data structures.
In Haskell though, we tend to build up loops by composing separate combinators, as if building
up a pipeline in the shell. For list operations, these then get fused away,
(using build/fold
fusion), meaning only a single traversal need be done.
Similarly, for array programming, we can use stream fusion to
eliminate traversals over arrays, making array loops written in a compositional
style as efficient as manually fused loops -- assuming all the mystical
laziness doesn't get in the way.
The data parallel and
uvector array libraries
are implemented using stream fusion for strict arrays, so we can try this out.
Here's an encoding of the take this sum of shifts of sums of arrays, compositionally:
import Data.Array.Vector
import Data.Bits
main = print . sumU
. mapU ((2::Int) `shift`)
. zipWith3U plus3
(replicateU 100000000 (2::Int))
(replicateU 100000001 (3::Int)) $
(replicateU 100000002 (5::Int))
where
plus3 x y z = x + y + z
The nice aspect of this is the reduction in cognitive load: you build the loop
up piecewise, and let the compiler do the (aggressive) thinking for you. As an
exercise, try writing a C program that allocates these 3 arrays, and then
uses a single loop to initialise them, add their elements, bit shift them, and
sum the result.
And how well does it work? Our program above undergoes aggressive inlining (as
it's notionally pure and lazy) by GHC, and as we've used combinators, rather
than manual loops, GHC can see how to easily combine them, yielding this single
loop at the end:
fold :: Int# -> Int# -> Int# -> Int# -> Int#
fold x y z acc =
case z of _ ->
case y of _ ->
case x of _ ->
fold (x +# 1)
(y +# 1)
(z +# 1)
(acc +# 2048)
100000000 -> acc
100000001 -> acc
100000002 -> acc
Look at that constant folding! The compiler saw the arrays were initialised with
all elements to 2, 3 and 5, which were then added, and shifted. The compiler spotted this
and realised we're just describing a loop that adds 2 << 10 in a loop.
The other thing it noticed was that the values where always going to be demanded. There's
no laziness in this program. So GHC removed all the lazy Int allocations and replaced them
with unboxed Int# types -- raw machine integers living in registers.
There are no `let's left in this code -- so *nothing* is going to end up on the
heap, and the garbage collector isn't going to get a chance to run!
With strictness analysis and aggressive loop fusion, we end up with the code
we'd write in a low level, non-GC'd language like C -- we just got to write it
in a very declarative, descriptive way.
This tight loop gets passed off to the code generator, yielding the final
assembly,
fold:
cmpq $100000000, %r8
je .L11
.L5:
cmpq $100000001, %rdi
je .L11
.L8:
cmpq $100000002, %rsi
je .L11
.L10:
leaq 1(%rsi), %rsi
leaq 1(%rdi), %rdi
leaq 1(%r8), %r8
addq $2048, %r9
jmp fold
.L11:
movq %r9, %rbx
jmp *(%rbp)
Which is looks pretty good. Just a tight loop, no allocation, everything in registers.
$ time ./zipwith3
204800000000
./zipwith3 0.20s user 0.00s system 95% cpu 0.209 total
It fairly zips along.
This kind of very aggressive optimisation only makes sense if you've already statically typed
your program, inferred that the loops are referentially transparent, and decided what algorithm
each loop is actually doing. And GHC is doing this by default -- which is pretty sweet.
Strictness analysis and loop fusion should make the new array libraries a competitive solution.
/home ::
/haskell ::
permalink ::
rss