After the
last post about high performance, high level programming, Slava Pestov, of
Factor fame, wondered whether it was
generally true that "if you want good performance you have to write C in your
language". It's a good question to ask of a high level language.
In this post I want to show how, often, we can answer "No". That by
working at a higher abstraction level we can get the same low level
performance, by exploiting the fact that the compiler
knows a lot more about what our code does. We can teach the compiler to
better understand the problem domain, and in doing so, enable the
compiler to optimise the code all the way down to raw assembly we want.
Specifically, we'll exploit stream fusion -- a new compiler optimisation that
removes intermediate data structures produced in function pipelines to produce
very good code, yet which encourages a very high level coding style. The best
of high level programming, combined with best of raw low level throughput.
The micro-benchmark: big list mean
This post is based around a simple microbenchmark developed in the last post
of this
series: computing the mean of a list of floating point values.
It's a tiny program, but one with interesting properties: the list is enormous
(10^9 values) -- so dealing with it effectively requires laziness of some
kind, and a naive implementation is far too inefficient, so we have to be
smarter with our loops.
The list is generated via an enumeration -- a key property the compiler
can later exploit -- and we have two reference implementations on hand. One in
Haskell, using a manually fused, tail recursive style, with a worker/wrapper
transformation applied:
mean :: Double -> Double -> Double
mean n m = go 0 0 n
where
go :: Double -> Int -> Double -> Double
go s l x | x > m = s / fromIntegral l
| otherwise = go (s+x) (l+1) (x+1)
main = do
[d] <- map read `fmap` getArgs
printf "%f\n" (mean 1 d)
And a straight forward translation to C, with all data structures
fused away:
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv) {
double d = atof(argv[1]);
double n;
long long a; // 64 bit machine
double b;
// go_s17J :: Double# -> Int# -> Double# -> Double#
for (n = 1,
a = 0,
b = 0; n <= d; b+=n,
n++,
a++)
;
printf("%f\n", b / a);
return 0;
}
The C will serve as a control -- a lower bound to which we want to reach.
Both reference programs compute the same thing, in the same time, as they have
pretty much identical runtime representations:
$ gcc -O2 t.c
$ time ./a.out 1e9
500000000.067109
./a.out 1e9 1.75s user 0.00s system 99% cpu 1.757 total
and
$ ghc -O2 A.hs -optc-O2 -fvia-C --make
$ time ./A 1e9
500000000.067109
./A 1e9 1.76s user 0.00s system 100% cpu 1.760 total
In the previous post we looked at how and why the Haskell version is
able to compete directly with C here, when working at a low level
(explicit recursion). Now let's see how to fight the battle at high
altitude.
Left folds and deforestation
Clearly, writing in a directly recursive style is a reliable path to
fast, tight loops, but there's a programmer burden: we had to sit and
think about how to combine the list generator with its consumer. For
cheaper cognitive load we'd really like to keep the logically separate
loops ... literally separate, by composing a list generator with a list
consumer. Something like this:
mean n m = s / fromIntegral l
where
(s, l) = foldl' k (0, 0) [n .. m]
k (s, l) a = (s+a, l+1)
The fold is the key. This is a nice decoupling of the original recursive
loop -- the generator is abstracted out again into an enumeration, and
the process of traversing the list once, accumulating two values (the
sum and the length), is made clear. This is the kind of code we want to
write. However, there are some issues at play that make this not
quite the required implementation (as it doesn't yield the same runtime
representation as the control implementations):
- The tuple accumulator is too lazy for a tail recursive loop
- The accumulator type is an overly expensive unbounded Integer
The fix is straightforward: just use a strict pair type for nested
accumulators:
data P = P !Double !Int
mean :: Double -> Double -> Double
mean n m = s / fromIntegral l
where
P s l = foldl' k (P 0 0) [n .. m]
k (P s l) a = P (s+a) (l+1)
Strict products, of atomic types, have a great property: when
used like this they can be represented as sets of register variables
(compile with -funbox-strict-fields). The P data type is essentially an
abstract mapping of from P's fields into two registers for use as loop
accumulators. We can be fairly certain the fold will now compile to a
loop whose accumulating parameters are passed and returned (thanks to the
"constructed product result" optimisation), in registers.
This is easy to confirm via ghc-core.
Compiling the code, we see the result below. The '#' suffix indicates
the Int and Double types are unlifted -- they won't be stored on the
heap, and will most likely be kept in registers. The return type, an
unlifted pair (#, #), is also excellent: it means the result returned
from the function will be kept in registers as well (we don't need to
allocate at all to get the fold's state in or out of the function!). So
now we're getting closer to the optimal representation. The core we
get:
go :: Double# -> Int# -> [Double] -> (# Double#, Int# #)
go a b c = case c of
[] -> (# a, b #)
x : xs -> case x of
D# y -> go (+## a y) (+# b 1) xs
This will run in constant space, as the lazy list is demanded (so the
first part of the problem is solved), and the
result will be accumulated in registers (so that part is fine). But
there's still see a performance penalty here -- the list of Doubles
is still concretely, lazily constructed. GHC wasn't able to
spot that the generator of sequential values could be combined with the
fold that consumes it, avoiding the list construction entirely.
This will cost us quite a bit in memory traffic in such a tight loop:
$ time ./B 1e9
500000000.067109
./B 1e9 66.92s user 0.38s system 99% cpu 1:07.45 total
Right, so we got an answer, but dropping out to the heap to get the next
list node is an expensive bottleneck in such a tight loop, as expected.
The deeper problem here is that the compiler isn't able to combine left
folds with list generators into a single loop. GHC can do
this for right folds, as described in the original paper for this
optimisation, A
Short Cut to Deforestation . That is, if you compose certain
functions written as right folds GHC will automatically remove
intermediate lists between them, yielding a single loop that does the
work of both, without any list being constructed. This will work nicely
for pipelines of the following functions: head, filter, iterate, repeat,
take, and, or, any, all, concat, foldr, concatMap, zip, zipWith, ++,
map, and list comprehensions.
But it won't work for some key left folds such as foldl, foldl', foldl1,
length, minimum, maximum, sum and product. Another downside is that the
enumFromTo list generator only fuses for Int, Char and Integer types --
an oversight in the current implementation.
What we need is a different deforestation strategy -- one that teaches
the compiler how to see through the left fold.
Stream fusion
An alternative (and new) deforestation optimisation for sequence types,
such as lists or arrays, is stream
fusion, which overcomes the foldr bias of GHC's old method.
As GHC allows custom optimisations to be written by users in libraries
this is cheap to try out -- the new optimisation comes bundled in the
libraries of various new data structures: the
fusible
list package, and a new fusible
arrays libray. It is also available for distributed
parallel arrays, as part of the data
parallel Haskell project, from which the non-parallel fusible arrays
are derived.
There are two main enabling technologies that have brought about this
abundance of data structure deforestation optimisations in Haskell:
user-specified term rewriting rules, and ubiquitous, pure higher order
functions. If you don't have the former the burden of adding new
optimisations by hacking the compiler is too high for mortals, and if
you don't have the latter -- HOFs with guaranteed purity -- there
simply won't be enough fusion opportunities to make the transformation
worthwhile. Happily, GHC Haskell has both -- in particular, it can be
more aggressive as it knows the loops will have no side effects when we
radically rearrange the code. Fun, fun, fun.
So, taking the fusible arrays library, we can rewrite our
fold example as, precisely:
import System.Environment
import Data.Array.Vector
import Text.Printf
data P = P !Double !Int
mean :: Double -> Double -> Double
mean n m = s / fromIntegral l
where
P s l = foldlU k (P 0 0) (enumFromToFracU n m)
k (P s l) a = P (s+a) (l+1)
main = do
[d] <- map read `fmap` getArgs
printf "%f\n" (mean 1 d)
Which is practically identical to the naive list version, we simply replaced
foldl' with foldlU, and [n .. m] with an explict enumeration (as list
enumerator syntax isn't available). GHC compiles the fold down to:
RuleFired
1 streamU/unstreamU
go :: Double# -> Int# -> Double# -> (# Double#, Int# #)
go= \ a b c ->
case >## c limit of
False -> go (+## a c) (+# b 1) (+## c 1.0)
True -> (# a, b #)
Look how the list generator and consumer have been combined
into a single loop with 3 register variables, corresponding
to the sum, the length, and the next element to generate
in the list. The rule firing indicates that an intermediate array
was removed from the program. Our "array program" allocates no arrays -- as
the compiler is able to see through the problem all the way from array
generation to consumption! The final assembly looks really good (ghc
-funbox-strict-fields -O2 -fvia-C -optc-O2):
s1rD_info:
ucomisd 5(%rbx), %xmm6
ja .L87
addsd %xmm6, %xmm5
addq $1, %rsi
addsd .LC2(%rip), %xmm6
jmp s1rD_info
And the key test: the assembly from the high level approach is
practically identical to the hand fused, worker/wrapper applied low
level implementation:
s1bm_info:
ucomisd 8(%rbp), %xmm6
ja .L73
addsd %xmm6, %xmm5
addq $1, %rbx
addsd .LC0(%rip), %xmm6
jmp s1bm_info
And to the original C implementation:
.L5:
addsd %xmm1, %xmm3
addq $1, %rax
addsd %xmm2, %xmm1
ucomisd %xmm1, %xmm0
jae .L5
As a result the performance should be excellent:
$ time ./C 1e9
500000000.067109
./UFold 1e9 1.77s user 0.01s system 100% cpu 1.778 total
If we look at the runtime statistics (+RTS -sstderr) we see:
16,232 bytes allocated in the heap
%GC time 0.0% (0.0% elapsed)
So 16k allocated in total, and the garbage collector was never invoked.
There was simply no abstraction penalty in this program! In fact, the
abstraction level made possible the required optimisations.
How it works
Here's a quick review of how this works. For the full treatment,
see the paper.
The first thing to do is to represent arrays as abstract unfolds over a
sequence type, with some hidden state component and a final length:
data Stream a = exists s. Stream (s -> Step s a) !s Int
data Step s a = Done
| Yield !a !s
| Skip !s
This type lets us represent traversals and producers of lists as simple
non-recursive stepper functions -- abstract loop bodies -- which the compiler
already knows how to optimise really well. Streams are basically
non-recursive functions that either return an empty stream, or yield an
element and the state required to get future elements, or they skip an
element (this lets us implement control flow between steppers by
switching state around).
We'll then need a way to turn arrays into streams:
streamU :: UA a => UArr a -> Stream a
streamU !arr = Stream next 0 n
where
n = lengthU arr
next i | i == n = Done
| otherwise = Yield (arr `indexU` i) (i+1)
And a way to turn streams back into (unlifted, ST-encapsulated) arrays:
unstreamU :: UA a => Stream a -> UArr a
unstreamU st@(Stream next s n) = newDynU n (\marr -> unstreamMU marr st)
unstreamMU :: UA a => MUArr a s -> Stream a -> ST s Int
unstreamMU marr (Stream next s n) = fill s 0
where
fill s i = case next s of
Done -> return i
Skip s' -> fill s' i
Yield x s' -> writeMU marr i x >> fill s' (i+1)
This fills a chunk of memory by unfolding the yielded elements of the
stream. Now, the key optimisation to teach the compiler:
{-# RULES
"streamU/unstreamU" forall s.
streamU (unstreamU s) = s
#-}
And that's it. GHC now knows how to remove any occurrence of an array
construction followed by its immediate consumption.
We can write an enumerator and foldl for streams:
enumFromToFracS :: Double -> Double -> Stream Double
enumFromToFracS n m = Stream next n (truncate (m n))
where
lim = m + 1/2
next s | s > lim = Done
| otherwise = Yield s (s+1)
foldS :: (b -> a -> b) -> b -> Stream a -> b
foldS f z (Stream next s _) = fold z s
where
fold !z s = case next s of
Yield x !s' -> fold (f z x) s'
Skip !s' -> fold z s'
Done -> z
And we're almost done. Now to write concrete implementations of foldU
and enumFromToFracU is done by converting arrays to streams and folding
and enumerating those instead:
enumFromToU :: (UA a, Integral a) => a -> a -> UArr a
enumFromToU start end = unstreamU (enumFromToS start end)
foldlU :: UA a => (b -> a -> b) -> b -> UArr a -> b
foldlU f z = foldS f z . streamU
So, if we write a program that composes foldlU and enumFromToU, such as
our big mean problem:
= foldlU k (P 0 0) (enumFromToFracU n m)
The compiler will immediately inline the definitions to:
= foldS k (P 0 0) . streamU . unstreamU . enumFromToS start end
The fusion rule kicks in at this point, and we've killed off the
intermediate array:
= foldS k (P 0 0) . enumFromToS start end
Now we have a single loop, the foldS, which takes a non-recursive list
generator as an argument. GHC then squooshes away all the intermediate
Step constructors, leaving a final loop with just the list generator
index, and the pair of state values for the sum and length:
go :: Double# -> Int# -> Double# -> (# Double#, Int# #)
go= \ a b c ->
case >## c limit of
False -> go (+## a c) (+# b 1) (+## c 1.0)
True -> (# a, b #)
Exactly the low level code we wanted -- but written for us by the compiler.
And the final performance tells the story:
$ time ./C 1e9
500000000.067109
./UFold 1e9 1.77s user 0.01s system 100% cpu 1.778 total
A more complicated version
We can try a more complicated variant, numerically stable mean:
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv) {
double d = atof(argv[1]);
double n;
long long a; // 64 bit machine
double b;
// go :: Double# -> Int# -> Double# -> Double#
for (n = 1,
a = 1,
b = 0; n <= d; b+= (n b) / a,
n++,
a++)
;
printf("%f\n", b);
return 0;
}
Implemented in Haskell as a single fold, which computes the stable
average each time around:
import Text.Printf
import System.Environment
import Data.Array.Vector
mean :: UArr Double -> Double
mean = fstT . foldlU k (T 0 1)
where
k (T b a) n = T b' (a+1)
where b' = b + (n b) / fromIntegral a
data T = T !Double !Int
fstT (T a _) = a
main = do
[d] <- map read `fmap` getArgs
printf "%f\n" (mean (enumFromToFracU 1 d))
We can work a higher level than the C code, which already manually fuses away
the list generation and consumption. The end result is the same thing though:
$ ghc-core F.hs -O2 -optc-O2 -fvia-C -funbox-strict-fields
$ time ./F 1e9
500000000.500000
./F 1e9 19.28s user 0.00s system 99% cpu 19.279 total
$ gcc -O2 u.c
$ time ./u 1e9
500000000.500000
./u 1e9 19.09s user 0.00s system 100% cpu 19.081 total
So the fast inner loops are fine, and we get to use a high level
language for all the surrounding glue. That's just super.
So, to answer Slava's question: yes, we often can get the performance of
C, while working at a higher altitude. And to do so we really want to
exploit all the additional information we have at hand when working more
abstractly. And the user doesn't have to know this stuff -- the library
already does all the thinking. Knowledge is (optimisation) power.
Thinking at a higher level enables the compiler to do more powerful
things, leaving less work for the programmer. It's all win.
/home ::
/haskell ::
permalink ::
rss