In an oddly-titled post earlier
today I'd had too much coffee, and we looked at how compiled Haskell code
smokes out interpreted code (for various reasons, mostly to do with being
compiled and not interpreted). However, the real point of the article (other
than to burn things with my flame thrower) was to start to explore the new
parallelism annotations in Haskell.
So let's continue that, with some more explorations of how far we can get with
parallel annotations, and whether Haskell can compete for C, given enough
cores. (And I'd just like to note that Spencer Janssen (of the xmonads) contributed most of the code and ideas
for this article :).
We'll stick to the the naive
fibonacci implementation, but switch to a 4 core machine, and see how far we
can scale up the Haskell code as we add cores, without resorting to manual
parallel programming.
So, the naive Haskell, but we'll compute to a reasonable size of N:
fib :: Int -> Int
fib 0 = 0
fib 1 = 1
fib n = fib (n1) + fib (n2)
main = forM_ [0..45] $ \i ->
printf "n=%d => %d\n" i (fib i)
All very pretty. Now, if we run it on a 4 core, 3Ghz linux machine, will it use
those resources?
$ ghc-6.8.1 -O2 Naive.hs -o naive --make
$ time ./naive
n=0 => 0
n=1 => 1
n=2 => 1
...
n=44 => 701408733
n=45 => 1134903170
./naive 39.03s user 0.00s system 99% cpu 39.108 total
99%. So the answer is no: GHC doesn't magically parallelise the
naive code (nor memoise it, guys...). Remember that number: 39s. That's our
goal to beat by using the extra cores.
However, we can give the compiler some hints about how best to parallelise
the code, using the lovely `par` annotation (from Control.Parallel) (originally from this paper).
From the manual:
The expression (x `par` y) sparks the evaluation of x (to weak head normal
form) and returns y. Sparks are queued for execution in FIFO order, but are
not executed immediately. If the runtime detects that there is an idle CPU,
then it may convert a spark into a real thread, and run the new thread on the
idle CPU. In this way the available parallelism is spread amongst the real
CPUs.
So let's naively annotate this. Just split the tree into two parts, and run the
first branch in parallel with the other, hoping that it finishes about the same
time as the second, so there's no waiting:
import Control.Parallel
import Control.Monad
import Text.Printf
fib :: Int -> Int
fib 0 = 0
fib 1 = 1
fib n = r `par` (l `pseq` l+r)
where
l = fib (n1)
r = fib (n2)
main = forM_ [0 .. 45] $ \i ->
printf "n=%d => %d\n" i (fib i)
Let's compile this with some reasonable flags:
$ ghc-6.8.1 NaivePar.hs -O2 -o np --make -threaded
And we can toss two cores at it to start with:
./np +RTS -N2 138.69s user 1.18s system 190% cpu 72.48 total
Ok, how about 3, or 4 cores?
./np +RTS -N3 160.39s user 1.98s system 261% cpu 62.15 total
./np +RTS -N4 167.61s user 2.26s system 311% cpu 54.53 total
Hmm, interesting! While the cpus are getting utilised, we're not making much
progress towards our naive single core goal. What is going on?
The problem, of course, is that we're wasting time registering thread sparks
for very small expressions (anything under about N=35 or so). We should really
not use `par` for those little jobs, since the cost of registering a thread
spark outweighs the cost of just evaluating it here and now.
So what we can do is use the `par` version when N is larger, and
drop back to straight line code for smaller jobs. That should
do the trick.
import Control.Parallel
import Control.Monad
import Text.Printf
cutoff = 35
fib' :: Int -> Integer
fib' 0 = 0
fib' 1 = 1
fib' n = fib' (n1) + fib' (n2)
fib :: Int -> Integer
fib n | n < cutoff = fib' n
| otherwise = r `par` (l `pseq` l + r)
where
l = fib (n1)
r = fib (n2)
main = forM_ [0..45] $ \i ->
printf "n=%d => %d\n" i (fib i)
So that's the same algorithm, just split into two loops, once of which creates
thread sparks. Now, we can try this with a couple of cores:
$ ghc-6.8.1 Par.hs -O2 -o real-par --make -threaded
$ time ./real-par +RTS -N2
n=0 => 0
n=1 => 1
n=2 => 1
...
n=44 => 701408733
n=45 => 1134903170
./real-par +RTS -N2 71.98s user 0.49s system 191% cpu 37.866 total
Ok. Cool, with two cores, and the `par` overhead, we're actually beating
one core now. How about 3?
./real-par +RTS -N3 75.03s user 0.82s system 262% cpu 28.854 total
Excellent. And how about the lot?
./real-par +RTS -N4 76.81s user 0.75s system 351% cpu 22.059 total
Haskell FTW! So that's scaling up enough for now, and, considering the
effort involved to parallelise it, I'm more than happy with that result.
This is, as far as I'm aware, the lightest weight parallelism mechanism in any
mainstream language. And the magical thing is that we parallelised our code
without ever worrying about synchronisation, communication, race conditions,
dead locks, live locks. semaphores, mutexes...
The other interesting thing to think about: at what point do we beat the same
algorithm in C, and how hard would it be to parallelise the algorithm in C with
pthreads... I'm not going to attempt the latter, but we can check the former:
#include
#include
long long fib(long long n) {
if (n < 2) {
return 1;
}
return fib(n - 2) + fib(n - 1);
}
int main(int argc, char ** argv) {
long long n = 0;
for (n = 0; n <= 45; n++) {
printf("Fib(%lld): %lld\n", n, fib(n));
}
return 0;
}
Compiled with:
$ gcc -O3 par.c -o par-c
And running it:
$ time ./par-c
Fib(0): 1
Fib(1): 1
...
Fib(43): 701408733
Fib(44): 1134903170
./par-c 32.91s user 0.00s system 99% cpu 32.960 total
32s. Wow! So with an off-the-shelf Linux box, you can write simple (but parallel)
Haskell will outperform gcc's best efforts by a good margin -- today!
Multicore programming just got a lot easier.
/home ::
/haskell ::
permalink ::
rss