In a recent mailing
list thread Andrew Coppin complained of poor performance
with "nice, declarative" code for computing the mean of a very large
list of double precision floating point values:
import System.Environment
import Text.Printf
mean :: [Double] -> Double
mean xs = sum xs / fromIntegral (length xs)
main = do
[d] <- map read `fmap` getArgs
printf "%f\n" (mean [1 .. d])
Which, when executed, churns away for a while, but eventually runs out
of memory:
$ ghc -O2 --make A.hs
$ time ./A 1e9
A: out of memory (requested 1048576 bytes)
./A 1e9 12.74s user 1.20s system 99% cpu 13.997 total
Well, it got 13 seconds into the job doing something. At least it was
making progress.
While Andrew was citing this as an example of GHC Haskell being
unpredictable (it's not an example of that, as we'll see), there is
something here -- this little program reveals a lot about the
interaction between strictness, laziness and approaches to compiler
optimisation in Haskell.
This post is about writing reliably fast code, and then how to write
more naive code that is also reliably fast. In particular, how recursion
consistently produces excellent code, competitive with heavily
optimised C, and how to get similar results from higher order functions.
Throughout this post I'll be using GHC 6.8.2, with -O2. Sometimes I'll
switch to the C backend (-fvia-C -optc-O2). Also, I don't care about
smarter ways to implement this -- we're simply interested in
understanding the compiler transformations involved in the actual loops
involved.
Understanding strictness and laziness
First, let's work out why this ran out of memory. The obvious
hard constraint is that we request a list of 1e9 doubles: that is very,
very large. It's 8 * 10^9 bytes, if allocated directly, or about 7.5G.
Inside a list there is yet further overhead, for the list nodes, and
their pointers. So no matter the language, we simply cannot allocate
this all in one go without burning gigabytes of memory. The
only approach that will work is to lazily generate the list
somehow. To confirm this, we can use a strict list data type, just to
see how far we get.
First, let's define a strict list data type. That is one whose head and
tail are always fully evaluated. In Haskell, strictness is declared by
putting bangs on the fields of the data structure, though it can also be
inferred based on how values are used:
data List a = Empty
| Cons !a !(List a)
This is a new data type, so we'll need some new functions on
it:
length' :: List a -> Int
length' = go 0
where
go n Empty = n
go n (Cons _ xs) = go (n+1) xs
sum' :: Num a => List a -> a
sum' xs = go 0 xs
where
go n Empty = n
go n (Cons x xs) = go (n+x) xs
enumFromTo' :: (Num a, Ord a) => a -> a -> List a
enumFromTo' x y | x > y = Empty
| otherwise = go x
where
go x = Cons x (if x == y then Empty else go (x+1))
I've written these in a direct worker/wrapper transformed, recursive
style. It's a simple, functional idiom that's guaranteed to get the job done.
So now we can compute the length and sum of these things, and generate
one given a range. Our 'mean' function stays much the same, just the
type changes from a lazy to strict list:
mean :: List Double -> Double
mean xs = sum' xs / fromIntegral (length' xs)
Testing this, we see it works fine for small examples:
$ time ./B 1000
500.5
./A 1000 0.00s user 0.00s system 0% cpu 0.004 total
But with bigger examples, it quickly exhausts memory:
$ time ./B 1e9
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize' to increase it.
./A 1e9 0.02s user 0.02s system 79% cpu 0.050 tota
To see that we can't even get as far as summing the list,
we'll replace the list body with undefined, and just ask
for the list argument to be evaluated before the body with
a bang patterns (a strictness hint to the compiler, much like
a type annotation suggests the desired type):
mean :: List Double -> Double
mean !xs = undefined
main = do
[d] <- map read `fmap` getArgs
printf "%f\n" (mean (enumFromTo' 1 d))
And still, there's no hope to allocate that thing:
$ time ./A 1e9
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize' to increase it.
./A 1e9 0.01s user 0.04s system 93% cpu 0.054 total
Strictness is not the solution here.
Traversing structures and garbage collection
So why is the naive version actually allocating the whole list anyway?
It was a lazy list, shouldn't it somehow avoid allocating the elements?
To understand why it didn't work, we can look at what:
mean :: [Double] -> Double
mean xs = sum xs / fromIntegral (length xs)
actually means. To reason about the code, one way is to just look at the
final, optimised Haskell GHC produces, prior to translation to
imperative code. This reduced Haskell is known as "core", and is the end
result of GHC's optimisations-by-transformation process, which
iteratively rewrites the original source into more and more optimised
versions.
Referring to the core is useful if you're looking for C-like performance,
as you can state precisely, to the register level, what your program
will do -- there are no optimisations to perform that would confuse the
interpretation. It is an entirely unambiguous form of Haskell, so we'll
use it to clarify any uncertainties. (For everyday programming, a
comfortable knowledge of laziness and strictness is entirely adequate,
so don't panic if you don't read core!).
To view the core, we can use ghc -O2 -ddump-simpl, or
the ghc-core tool. Looking at it for the naive program we see that 'mean' is translated to:
$wlen :: [Double] -> Int# -> Int#
...
$wsum'1 :: [Double] -> Double# -> Double#
...
main = ...
shows (
let xs :: [Double]
xs = enumFromTo1 lvl3 d_a6h
in
case $wsum'1 xs 0.0 of {
_ -> case $wlen xs 0 of {
z -> case ww_s19a /## (int2Double# z) of {
i -> D# i
)
It looks like a lot of funky pseudo-Haskell, but its made up
of very simple primitives with a precise meaning. First, length and sum
are specialised -- their polymorphic components replaced with concrete
types, and in this case, those types are simple, atomic ones, so Double
and Int are replaced with unlifted (or unboxed) values. That's good to
know -- unlifted values can be kept in registers.
We can also confirm that the input list is allocated lazily.
let xs :: [Double]
xs = enumFromTo1 lvl3 d_a6h
in ..
The 'let' here is the lazy allocation primitive. If you see it on
non-unboxed type, you can be sure that thing will be lazily allocated
on the heap. Remeber: core is like Haskell, but there's only 'let' and
'case' (one for laziness, one for evaluation).
So that's good. It means the list itself isn't costing us anything to
write. This lazy allocation also explains why we're able to make some
progress before running out of memory resources -- the list is only
allocated as we traverse it.
However, all is not perfect, the next lines reveal:
case $wsum'1 xs 0.0 of {
_ -> case $wlen xs 0 of {
z -> case ww_s19a /## (int2Double# z) of {
i -> D# i
Now, 'case' is the evaluation primitive. It forces evaluation to the
outermost constructor of the scrutinee -- the expression we're casing
on -- right where you see it.
In this way we get an ordering of operations set up by the compiler. In
our naive program, first the sum of the list is performed, then the
length of the list is computed, until finally we divide the sum by the
length.
That sounds fine, until we think about what happened to our lazily
allocated list. The initial 'let' does no work, when allocating.
However, the first 'sum' will traverse the entire list, computing the
sum of its elements. To do this it must evaluate the entire huge list.
Now, on its own, that is still OK -- the garbage collector will
race along behind 'sum', deallocating list nodes that aren't needed
anymore. However, there is a fatal flaw in this case. 'length' still
refers to the head of the list. So the garbage collector cannot release
the list: it is forced to keep the whole thing alive.
'sum', then, will allocate the huge list, and it won't be collected
till after 'length' has started consuming it -- which is too late.
And this is exactly as we observed. The original poster's unpredictably
is an entirely predictable result if you try to traverse a 7G lazy list
twice.
Note that this would be even worse if we'd been in a strict language:
the initial 'let' would have forced evaluation, so you'd get nowhere at
all.
Now we have enough information to solve the problem: make sure we make
only one traversal of the huge list. so we don't need to hang onto it.
Lazy lists are OK
The simple thing to do then, and the standard functional approach for
the last 40 years of functional programming is to write a loop to do
both sum and length at once:
mean :: [Double] -> Double
mean = go 0 0
where
go :: Double -> Int -> [Double] -> Double
go s l [] = s / fromIntegral l
go s l (x:xs) = go (s+x) (l+1) xs
We just store the length and sum in the function parameters, dividing
for the mean once we reach the end of the list. In this way we make only
one pass. Simple, straight forward. Should be the standard approach in
the Haskell hackers kit.
By writing it in an obvious recursive style that walks the list as it is
created, we can be sure to run in constant space. Our code compiles down to:
$wgo :: Double# -> Int# -> [Double] -> Double#
$wgo x y z =
case z of
[] -> /## x (int2Double# y);
x_a9X : xs_a9Y ->
case x_a9X of
D# y_aED -> $wgo (+## x y_aED) (+# y 1) xs_a9Y
We see that it inspects the list, determining if it is an empty list, or
one with elements. If empty, return the division result. If non-empty,
inspect the head of the list, taking the raw double value out of its
box, then loop, On a small list:
$ time ./B 1000
500.5
./B 1000 0.00s user 0.00s system 0% cpu 0.004 total
On the 7 gigabyte list:
$ time ./B 1e9
500000000.067109
./B 1e9 68.92s user 0.19s system 99% cpu 1:09.57 total
Hooray, we've computed the result! The lazy list got us home.
But can we do better? Looking at the GC and memory statistics (run the
program with +RTS -sstderr ) we can see how much work was going on:
24,576 bytes maximum residency (1 sample(s))
%GC time 1.4% (4.0% elapsed)
Alloc rate 2,474,068,932 bytes per MUT second
Productivity 98.6% of total user
What does this tell us? Firstly, garbage collection made up
a tiny fraction of the total time (1.4%), meaning the program was
doing productive thing 98.6% of the time -- that's very good.
Also, we can see it ran in constant space: 24k bytes maximum allocated
in the heap. And the program was writing and releasing these list nodes
at an alarming rate. 2G a second (stunning, I know). Good thing
allocation is cheap.
However, this does suggest that we can do better -- much better --
by avoiding any list node allocation at all and keeping off the bus. We
just need to prevent list nodes from being allocated at all -- by
keeping only the current list value in a register -- if we can stay out
of memory, we should see dramatic improvements.
Recursion kicks arse
So let's rewrite the loop to no longer take a list, but instead,
the start and end values of the loop as arguments:
mean :: Double -> Double -> Double
mean n m = go 0 0 n
where
go :: Double -> Int -> Double -> Double
go s l x | x > m = s / fromIntegral l
| otherwise = go (s+x) (l+1) (x+1)
main = do
[d] <- map read `fmap` getArgs
printf "%f\n" (mean 1 d)
We've moved the list lower and upper bounds into arguments to the
function. Now there are no lists whatsover. This is a fusion
transformation, instead of two separate loops, one for generation of the
list, and one consuming it, we've fused them into a single loop with all
operations interleaved. This should now avoid the penalty of the list
memory traffic. We've also annotated the types of the functions, so there can
be no ambiguity about what machine types are used.
Looking at the core:
$wgo_s17J :: Double# -> Int# -> Double# -> Double#
$wgo_s17J x y z =
case >## z m of
False ->
$wgo_s17J
(+## x z)
(+# y 1)
(+## z 1.0);
True -> /## x (int2Double# y)
it is remarkably simple. All parameters are unboxed, the loop is tail
recursive, and we can expect zero garbage collection, all list
parameters in registers, and we'd expect excellent performance.
The >## and +## are GHC primitives for double comparison and addition.
+# is normal int addition.
Let's compile it with GHC's native code generator first (-O2
-fexcess-precision):
$ time ./C 1e9
500000000.067109
./C 1e9 3.75s user 0.00s system 99% cpu 3.764 total
Great. Very good performance! The memory statistics tell a similar rosy story:
20,480 bytes maximum residency (1 sample(s))
%GC time 0.0% (0.0% elapsed)
Alloc rate 10,975 bytes per MUT second
Productivity 100.0% of total user
So it ran in constant space, did no garbage collection, and was
allocating only a few bytes a second. Recursion is the breakfast of
champions.
And if we use the C backend to GHC: (-O2 -fexcess-precision -fvia-C
-optc-O2), things get seriously fast:
$ time ./C 1e9
500000000.067109
./C 1e9 1.77s user 0.00s system 99% cpu 1.776 total
Wow, much faster. GCC was able to really optimise the loop
GHC produced. Good job.
Let's see if we can get this kind of performance in C itself. We
can translate the recursive loop directly into a for loop, where
function arguments becomes the variables in the loop:
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv) {
double d = atof(argv[1]);
double n;
long long a; // 64 bit machine
double b;
// go_s17J :: Double# -> Int# -> Double# -> Double#
for (n = 1,
a = 0,
b = 0; n <= d; b+=n,
n++,
a++)
;
printf("%f\n", b / a);
return 0;
}
That was quite straight forward, and the C is nicely concise.
It is interesting to see how function parameters are updated in place
explicitly in the C code, while its implicitly reusing stack slots (or
registers) in the functional style. But they're really describing the
same code. Now, compiling the C without optimisations:
$ time ./a.out 1e9
500000000.067109
./a.out 1e9 4.72s user 0.00s system 99% cpu 4.723 total
Ok. That's cool. We got the same results, and optimised Haskell beat
unoptimsed C. Now with -O, gcc is getting closer to catch GHC:
$ time ./a.out 1e9
500000000.067109
./a.out 1e9 2.10s user 0.00s system 99% cpu 2.103 total
And with -O2 it makes it in front by a nose:
$ time ./a.out 1e9
500000000.067109
./a.out 1e9 1.76s user 0.00s system 99% cpu 1.764 total
gcc -O2 wins the day (barely?), just sneaking past GHC -O2, which comes in
ahead of gcc -O and -Onot.
We can look at the generated assembly (ghc -keep-tmp-files) to find
out what's really going on.. GCC generates the
rather nice:
.L6:
addsd %xmm1, %xmm2
incq %rax
addsd %xmm3, %xmm1
ucomisd %xmm1, %xmm0
jae .L6
.L8:
cvtsi2sdq %rax, %xmm0
movl $.LC2, %edi
movl $1, %eax
divsd %xmm0, %xmm2
Note the very tight inner loop, and final division. Meanwhile, GHC produces,
with its native code backend:
s1bn_info:
ucomisd 5(%rbx),%xmm6
ja .Lc1dz
movsd %xmm6,%xmm0
addsd .Ln1dB(%rip),%xmm0
leaq 1(%rsi),%rax
addsd %xmm6,%xmm5
movq %rax,%rsi
movsd %xmm0,%xmm6
jmp s1bn_info
.Lc1dz:
cvtsi2sdq %rsi,%xmm0
divsd %xmm0,%xmm5
Quite a bit more junk in the inner loop, which explains the slowdown
with -fasm. That native code gen needs more work for competitve floating
point work. But with the GHC C backend:
s1bn_info:
ucomisd 5(%rbx), %xmm6
ja .L10
addsd %xmm6, %xmm5
addq $1, %rsi
addsd .LC1(%rip), %xmm6
jmp s1bn_info
.L10:
cvtsi2sdq %rsi, %xmm7
divsd %xmm7, %xmm5
Almost identical to GCC, which explains why the performance was so good!
Some lessons
Lesson 1: To write predictably fast Haskell -- the kind
that competes with C day in and out -- use tail recursion, and ensure
all types are inferred as simple machine types, like Int, Word, Float or
Double that simple machine representations. The performance is there if
you want it.
Lesson 2: Laziness has an overhead -- while it allows
you to write new kinds of programs (where lists may be used as control
structures), the memory traffic that results can be a penalty
if it appears in tight inner loops. Don't rely laziness to give you
performance in your inner loops.
Lesson 3: For heavy optimisation, the C backend to GHC
is still the way to go. Later this year a new bleeding edge native code
generator will be added to GHC, but until then, the C backend is still
an awesome weapon.
In a later post we'll examine how to get the same performance by using
higher order functions.
/home ::
/haskell ::
permalink ::
rss
2008-05-12
This post is about using GHC's rewrite rules engine to implement really
really cheap type class instance resolution at compile time. Have fun,
and remember: Those who would give up essential type safety, to
purchase a little temporary liberty, deserve neither liberty nor type
safety.
Haskell
type classes are a powerful technique for bounded parametric
polymorphism. Consider the type class Eq:
class Eq a where
(==), (/=) :: a -> a -> Bool
x /= y = not (x == y)
x == y = not (x /= y)
This defines the (==) and (/=) equality operators, and gives some
default implementations (that mean we need only define one of the two to
get both functions).
Now, equality only makes sense on a bounded set of types -- not all
types support it. Now, we can add a type to the set of types that
support equality by writing an instance of Eq for it. For example, with
Floats, we can define equality in terms of primitive machine equality on
floats:
instance Eq Float where
(F# x) == (F# y) = x `eqFloat#` y
Or for unbounded integers, in terms of the GMP integer routines:
instance Eq Integer where
(==) = gmp_eqInteger
(/=) = gmp_neqInteger
So now we can write:
main = do
print (pi == exp 1)
print (42 == 431)
And everything works nicely.
We get to reuse the same symbol, (==), for equality on any types that
support it.
Now, type classes desugar to dictionaries, in GHC, passing the instance
methods around. That is,
(==) :: a -> a -> Bool
Becomes:
(==) :: EqDict a -> a -> a -> Bool
where the dictionary stores the equality function to use when testing
'a's for equality. And when the type for 'a' is known statically, GHC
usually does the right thing, and resolves it statically.
What isn't widely known is that we can hack our own static dispatch
system together in GHC, using the compiler's term rewriting
capabilities -- known as rewrite rules. These are described in the
paper "Playing by
the rules: Rewriting as a practical optimization technique in GHC".
First, to turn on rewrite rule syntax, enable -fglasgow-exts:
{-# OPTIONS -fglasgow-exts #-}
Now, we need to define the type class operation we want to overload:
class MEEq a where
(=-=) :: a -> a -> Bool
We'll just define the type here, no concrete default implementation.
The next thing to do is to tell the compiler which types we are going
to have support (=-=):
instance MEEq ()
instance MEEq Bool
instance MEEq Int
Now this is a bit evil, since there's still no concrete implementation
of equality on any of these types. So if you tried to use the code at
this point it would fail.
Now, we'll write some standalone implementations of equality on these
types. This code would normally go in a instance body, but we'll just
have them float free:
eq_Bool :: Bool -> Bool -> Bool
eq_Bool True True = True
eq_Bool False False = True
eq_Bool _ _ = False
eq_Unit :: () -> () -> Bool
eq_Unit () () = True
Ok. So we could write:
main = print (True `eq_Bool` False)
But that doesn't let us do any overloading. What we need is a way to
spot (=-=) and have the compiler replace it with the appropriate
function, based on the method type.
And we can do this. GHC supports rewrite rules, which are used for a wide range of compile time optimizations, including sophisticated data structure transformations.
Normally, rewrite rules are used to match on a name, and replace the left
hand side with some new right hand side.
For example,
Is the rule for map fusion, replacing two traversals of a list with one.
A less used feature is the ability to match on an expression's type
on either the left or right hand side of a rewrite rule (the left and
right hand sides must have the same type).
This let's us trivially encode compile-time instance resolution:
So, whenever you see (=-=) at Bool or () type, replace it with the
concrete implementation. Overloading resolution, statically, for free.
The slight downside is that if the rule fails to fire, there's no
concrete implementation of (=-=) to fall back on at runtime. So you'll
get a runtime failure :-)
We can see this in action with:
main = do
print $ True =-= False
print $ () =-= ()
print $ 7 =-= (8 :: Int)
We can compile this code, and check the optimizer did the right thing
with ghc-core:
$ ghc-core A.hs
...
2 RuleFired
1 eq/Bool
1 eq/Unit
...
Excellent, the rules fired for Bool and Unit, replacing the abstract
(=-=) with its concrete implementation.
Running the resulting code:
$ ./a.out
False
True
Class.hs:10:0: No instance nor default method for class operation Main.=-=
And all is good, well, except for the Int, which wasn't resolved, since
we didn't write a rule for it.
We'd like to recover the property that real type classes have -- that
it's a type error to use a type class method at a type other than those
that support it. Thanks to Gwern Branwen for suggesting we try this.
The problem is: we want to replace any remnant calls to (=-=) with an
expression that will fail to compile. However, we're strongly
constrained by the rewrite rules system -- rewrite rules must always be
syntactically and type correct. However, and this is the deliciously
evil part, they don't need to be confluent or terminating.
So we can encode the missing instance type error as a non-terminating
rewrite rule! And then there's no risk of runtime failure -- as the
compiler won't terminate :-)
We do this very simply, with the rule:
This simply replaces itself with itself forever. Now, our type correct
programs work as expected:
main = do
print $ True =-= False
print $ () =-= ()
Results in:
$ ./A
False
True
However, if we introduce an unresolved overloading:
main = do
print $ True =-= False
print $ () =-= ()
print $ 7 =-= (8 :: Int)
We get a "type error":
$ ghc -O2 A.hs
^C
in the form of the compiler failed to terminate. As all good Haskellers
know, this is morally sound: one bottom is as good as another, so a type
error is as good as a divergent compiler! And well-typed programs are
still not going to go wrong.
We can also encode the type error as a (compile time) linker failure, by
replacing (=-=) with an unknown foreign symbol that we import
irregardless. Of course, this is very evil.
Rewrite rules are a powerful way for users to extend the optimization
suite of the compiler, and they're made so by purity and static type
information, enabling a richer set of possible transformations than
would be possible otherwise.
/home ::
/haskell ::
permalink ::
rss
2008-05-04
A common misconception from non-Haskellers is that Haskell, when compiled, pays
an ongoing penalty for supporting laziness by default. The idea is that in a
lazy language, every expression would suspend to a heap-allocated, garbage
collected, thunk. Even if you were to use the variable immediately.
That sounds scarily expensive.
In the real world however, Haskell is often as not a strict language.
Compiled Haskell undergoes strictness analysis which filters out the "needs to
be lazy" code, from the "just do it already" strict stuff. The end result is
purely functional code, written in a lazy style, that gets compiled down to raw,
strict assembly, without any thunks, heaps, or GC getting in the way of raw,
all out speed.
Like type inference, this means we usually don't have to manually annotate code
with its strictness (or laziness) properties, instead, the compiler figures it out
from how we use the values.
Here's a nice example, from a new arrays library, based on the larger
data parallel
arrays project.
We'll write some code to generate 3 arrays, add them together, then bit shift
all the elements, and finally, take the sum.
In a language like C, this would mean first allocating a bunch of arrays, writing
loops to fill them, then separate loops to do the the rest of the pipeline.
If we were looking for speed, we'd perhaps try to manually fuse some of the loop bodies together
to avoid extra traversals, and deforesting intermediate data structures.
In Haskell though, we tend to build up loops by composing separate combinators, as if building
up a pipeline in the shell. For list operations, these then get fused away,
(using build/fold
fusion), meaning only a single traversal need be done.
Similarly, for array programming, we can use stream fusion to
eliminate traversals over arrays, making array loops written in a compositional
style as efficient as manually fused loops -- assuming all the mystical
laziness doesn't get in the way.
The data parallel and
uvector array libraries
are implemented using stream fusion for strict arrays, so we can try this out.
Here's an encoding of the take this sum of shifts of sums of arrays, compositionally:
import Data.Array.Vector
import Data.Bits
main = print . sumU
. mapU ((2::Int) `shift`)
. zipWith3U plus3
(replicateU 100000000 (2::Int))
(replicateU 100000001 (3::Int)) $
(replicateU 100000002 (5::Int))
where
plus3 x y z = x + y + z
The nice aspect of this is the reduction in cognitive load: you build the loop
up piecewise, and let the compiler do the (aggressive) thinking for you. As an
exercise, try writing a C program that allocates these 3 arrays, and then
uses a single loop to initialise them, add their elements, bit shift them, and
sum the result.
And how well does it work? Our program above undergoes aggressive inlining (as
it's notionally pure and lazy) by GHC, and as we've used combinators, rather
than manual loops, GHC can see how to easily combine them, yielding this single
loop at the end:
fold :: Int# -> Int# -> Int# -> Int# -> Int#
fold x y z acc =
case z of _ ->
case y of _ ->
case x of _ ->
fold (x +# 1)
(y +# 1)
(z +# 1)
(acc +# 2048)
100000000 -> acc
100000001 -> acc
100000002 -> acc
Look at that constant folding! The compiler saw the arrays were initialised with
all elements to 2, 3 and 5, which were then added, and shifted. The compiler spotted this
and realised we're just describing a loop that adds 2 << 10 in a loop.
The other thing it noticed was that the values where always going to be demanded. There's
no laziness in this program. So GHC removed all the lazy Int allocations and replaced them
with unboxed Int# types -- raw machine integers living in registers.
There are no `let's left in this code -- so *nothing* is going to end up on the
heap, and the garbage collector isn't going to get a chance to run!
With strictness analysis and aggressive loop fusion, we end up with the code
we'd write in a low level, non-GC'd language like C -- we just got to write it
in a very declarative, descriptive way.
This tight loop gets passed off to the code generator, yielding the final
assembly,
fold:
cmpq $100000000, %r8
je .L11
.L5:
cmpq $100000001, %rdi
je .L11
.L8:
cmpq $100000002, %rsi
je .L11
.L10:
leaq 1(%rsi), %rsi
leaq 1(%rdi), %rdi
leaq 1(%r8), %r8
addq $2048, %r9
jmp fold
.L11:
movq %r9, %rbx
jmp *(%rbp)
Which is looks pretty good. Just a tight loop, no allocation, everything in registers.
$ time ./zipwith3
204800000000
./zipwith3 0.20s user 0.00s system 95% cpu 0.209 total
It fairly zips along.
This kind of very aggressive optimisation only makes sense if you've already statically typed
your program, inferred that the loops are referentially transparent, and decided what algorithm
each loop is actually doing. And GHC is doing this by default -- which is pretty sweet.
Strictness analysis and loop fusion should make the new array libraries a competitive solution.
/home ::
/haskell ::
permalink ::
rss