In an oddly-titled post earlier
today I'd had too much coffee, and we looked at how compiled Haskell code
smokes out interpreted code (for various reasons, mostly to do with being
compiled and not interpreted). However, the real point of the article (other
than to burn things with my flame thrower) was to start to explore the new
parallelism annotations in Haskell.
So let's continue that, with some more explorations of how far we can get with
parallel annotations, and whether Haskell can compete for C, given enough
cores. (And I'd just like to note that Spencer Janssen (of the xmonads) contributed most of the code and ideas
for this article :).
We'll stick to the the naive
fibonacci implementation, but switch to a 4 core machine, and see how far we
can scale up the Haskell code as we add cores, without resorting to manual
parallel programming.
So, the naive Haskell, but we'll compute to a reasonable size of N:
fib :: Int -> Int
fib 0 = 0
fib 1 = 1
fib n = fib (n1) + fib (n2)
main = forM_ [0..45] $ \i ->
printf "n=%d => %d\n" i (fib i)
All very pretty. Now, if we run it on a 4 core, 3Ghz linux machine, will it use
those resources?
$ ghc-6.8.1 -O2 Naive.hs -o naive --make
$ time ./naive
n=0 => 0
n=1 => 1
n=2 => 1
...
n=44 => 701408733
n=45 => 1134903170
./naive 39.03s user 0.00s system 99% cpu 39.108 total
99%. So the answer is no: GHC doesn't magically parallelise the
naive code (nor memoise it, guys...). Remember that number: 39s. That's our
goal to beat by using the extra cores.
However, we can give the compiler some hints about how best to parallelise
the code, using the lovely `par` annotation (from Control.Parallel) (originally from this paper).
From the manual:
The expression (x `par` y) sparks the evaluation of x (to weak head normal
form) and returns y. Sparks are queued for execution in FIFO order, but are
not executed immediately. If the runtime detects that there is an idle CPU,
then it may convert a spark into a real thread, and run the new thread on the
idle CPU. In this way the available parallelism is spread amongst the real
CPUs.
So let's naively annotate this. Just split the tree into two parts, and run the
first branch in parallel with the other, hoping that it finishes about the same
time as the second, so there's no waiting:
import Control.Parallel
import Control.Monad
import Text.Printf
fib :: Int -> Int
fib 0 = 0
fib 1 = 1
fib n = r `par` (l `pseq` l+r)
where
l = fib (n1)
r = fib (n2)
main = forM_ [0 .. 45] $ \i ->
printf "n=%d => %d\n" i (fib i)
Let's compile this with some reasonable flags:
$ ghc-6.8.1 NaivePar.hs -O2 -o np --make -threaded
And we can toss two cores at it to start with:
./np +RTS -N2 138.69s user 1.18s system 190% cpu 72.48 total
Ok, how about 3, or 4 cores?
./np +RTS -N3 160.39s user 1.98s system 261% cpu 62.15 total
./np +RTS -N4 167.61s user 2.26s system 311% cpu 54.53 total
Hmm, interesting! While the cpus are getting utilised, we're not making much
progress towards our naive single core goal. What is going on?
The problem, of course, is that we're wasting time registering thread sparks
for very small expressions (anything under about N=35 or so). We should really
not use `par` for those little jobs, since the cost of registering a thread
spark outweighs the cost of just evaluating it here and now.
So what we can do is use the `par` version when N is larger, and
drop back to straight line code for smaller jobs. That should
do the trick.
import Control.Parallel
import Control.Monad
import Text.Printf
cutoff = 35
fib' :: Int -> Integer
fib' 0 = 0
fib' 1 = 1
fib' n = fib' (n1) + fib' (n2)
fib :: Int -> Integer
fib n | n < cutoff = fib' n
| otherwise = r `par` (l `pseq` l + r)
where
l = fib (n1)
r = fib (n2)
main = forM_ [0..45] $ \i ->
printf "n=%d => %d\n" i (fib i)
So that's the same algorithm, just split into two loops, once of which creates
thread sparks. Now, we can try this with a couple of cores:
$ ghc-6.8.1 Par.hs -O2 -o real-par --make -threaded
$ time ./real-par +RTS -N2
n=0 => 0
n=1 => 1
n=2 => 1
...
n=44 => 701408733
n=45 => 1134903170
./real-par +RTS -N2 71.98s user 0.49s system 191% cpu 37.866 total
Ok. Cool, with two cores, and the `par` overhead, we're actually beating
one core now. How about 3?
./real-par +RTS -N3 75.03s user 0.82s system 262% cpu 28.854 total
Excellent. And how about the lot?
./real-par +RTS -N4 76.81s user 0.75s system 351% cpu 22.059 total
Haskell FTW! So that's scaling up enough for now, and, considering the
effort involved to parallelise it, I'm more than happy with that result.
This is, as far as I'm aware, the lightest weight parallelism mechanism in any
mainstream language. And the magical thing is that we parallelised our code
without ever worrying about synchronisation, communication, race conditions,
dead locks, live locks. semaphores, mutexes...
The other interesting thing to think about: at what point do we beat the same
algorithm in C, and how hard would it be to parallelise the algorithm in C with
pthreads... I'm not going to attempt the latter, but we can check the former:
#include
#include
long long fib(long long n) {
if (n < 2) {
return 1;
}
return fib(n - 2) + fib(n - 1);
}
int main(int argc, char ** argv) {
long long n = 0;
for (n = 0; n <= 45; n++) {
printf("Fib(%lld): %lld\n", n, fib(n));
}
return 0;
}
Compiled with:
$ gcc -O3 par.c -o par-c
And running it:
$ time ./par-c
Fib(0): 1
Fib(1): 1
...
Fib(43): 701408733
Fib(44): 1134903170
./par-c 32.91s user 0.00s system 99% cpu 32.960 total
32s. Wow! So with an off-the-shelf Linux box, you can write simple (but parallel)
Haskell will outperform gcc's best efforts by a good margin -- today!
Multicore programming just got a lot easier.
/home ::
/haskell ::
permalink ::
rss
Antonio
Cangiano writes about how Ruby 1.9 has improved in various ways, so
that the naive fibonacci algorithm
is finally faster in Ruby than in Python.
So I wondered how well Haskell would do at something like
this, and whether we could squeeze any more performance out using some
cheap parallelism. Looking at the code,
The Ruby:
def fib(n)
if n == 0 || n == 1
n
else
fib(n-1) + fib(n-2)
end
end
36.times do |i|
puts "n=#{i} => #{fib(i)}"
end
The Python:
def fib(n):
if n == 0 or n == 1:
return n
else:
return fib(n-1) + fib(n-2)
for i in range(36):
print "n=%d => %d" % (i, fib(i))
The Haskell:
fib :: Int -> Int
fib 0 = 0
fib 1 = 1
fib n = fib (n1) + fib (n2)
main = forM_ [0..35] $ \i ->
printf "n=%d => %d\n" i (fib i)
And, since I've got two cores in my laptop, the Haskell annotated for
speculative parallelism:
import Control.Parallel
fib :: Int -> Int
fib 0 = 0
fib 1 = 1
fib n = l `pseq` r `pseq` l+r
where
l = fib (n1)
r = fib (n2)
main = forM_ [0..35] $ \i ->
printf "n=%d => %d\n" i (fib i)
One thing that stands out to me is that Haskell looks a fair bit like the
Python, and they're all about the same "development effort". The other startling
thing is how cheap it is to turn the single processor Haskell code into one with
hints on how to evaluate it in parallel. The algorithm doesn't change,
we just add parallelism hints that the compiler then uses to parallelise
the code. Importantly, and unlike say, Erlang or Concurrent Haskell, we
don't have to do manual thread creation, synchronisation or
communication -- the compiler does all that for us!
However, the important differences emerge when we run the code: only one
implementation here does type erasure (meaning no wasteful runtime type
checks), produces native code, always
optimises tail calls to gotos, and can do parallelism annotations...
So parallel Haskell is around 60x Python here, and 150x faster than
Ruby. And there was basically no difference in development effort
required. Which high level language would you choose?
Over on Good Math, Bad Math, MarkCC writes about Erlang and Haskell:
Haskell systems don't, in general, parallelize well. They're particularly bad
for the kind of very coarse thread-based concurrency that we need to program
for on multi-core computers, or distributed systems.
I suspect Mark hasn't written much multicore concurrent Haskell code, as he resorts
to the usual "monads are hard" argument, rather than offering example code.
However, here in the real world, concurrent programming in Haskell is pretty
much identical to Erlang, though with nicer syntax, in a more modern language.
Importantly, there are no monadic headaches involved (monads don't even enter
the picture). And the results often out-perform
Erlang (and most other languages), due to native code compilation with type
erasure, and better data structure support, provided by Haskell libraries
(particular for strings) simply not available in Erlang.
So let's look at the basic tools at our disposal here are:
Starting first with a look at the Erlang code. Mark starts with an introduction to syntax:
-module(map).
-export(map/2, two/1).
map(F, []) -> [];
map(F,[Car|Cdr]) ->
NewCar = apply(F, [Car]),
NewCdr = map(F,Cdr),
[NewCar|NewCdr].
two(X) -> 2*X.
Which looks a bit like an antique Haskell, with funny numbers instead of a type
system (in the export list). Rewriting in Haskell:
module Map (map, two) where
map _ [] = []
map f (x:xs) = f x : map f xs
two = (2*)
Ok, so simpler and cleaner (I mean, really, using `apply' for function
application!?)
Now, let's look at the Erlang for forking a simple process, that receives
messages in a loop:
-module(echo).
-export([start/0, loop/0]).
start() ->
spawn(echo, loop, []).
loop() ->
receive
{From, Message} ->
From ! Message,
loop()
end.
This translates directly to Haskell's forkIO and MVars for communication:
We just launch the main thread here, which forks a second Haskell thread, which
in turn sits receiving messages and printing them:
main = do
box <- newEmptyMVar
forkIO (echo box)
echo c = forever $ do
msg <- takeMVar c
print msg
'spawn' maps to 'forkIO', and Erlang's builtin message passing maps to a
communication mechanism of your choice in Haskell. We chose MVars here, but you
could also have gone with
To run such a threaded Haskell program across multiple cores (this coarse
gained, multicore concurrency Mark talks about), we need to nothing more than
compile in the SMP runtime:
$ ghc -threaded -O Concurrent.hs
Now that's compiled to real native code, on an smp parallel runtime! We can
run this over as many cores as we have, and GHC will schedule threads across
the cores:
$ ./a.out +RTS -N2
There are no monadic headaches. Concurrent and parallel multicore programming
in Haskell is simple, efficient and easy!
Since its so easy, and has such little impact on the structure of your Haskell
programs, you can start speculatively supporting multicore hardware: your
Haskell program will utilise as many real cores as it needs, without needing to
recompile and modify the code. Just change the value of `N', and throw around
forkIO liberally, much as you would in Erlang.
So let's do something useful with this, how about a little program that
computes primes and fibonacci numbers? We'll just fork processes to compute
prime numbers and fibonacci numbers, and have the main thread lazily print
results as they're found:
import Control.Concurrent.Chan
import Control.Concurrent
main = do
primes <- run primes
fibs <- run fibonacci
mapM_ print $ zip primes fibs
where
run f = do
c <- newChan
l <- getChanContents c
forkIO (writeList2Chan c f)
return l
primes = sieve [2..]
where
sieve (p : xs) = p : sieve [x | x <- xs, x `mod` p > 0]
fibonacci = 0 : 1 : zipWith (+) fibonacci (tail fibonacci)
Very importantly, we see our computational processes are just pure Haskell code,
and our main function is some IO skin, as usual. This is almost identical to code
you would write ignoring concurrency: nothing scary is needed to write parallel
Haskell!
Compiling this with the parallel runtime:
$ ghc -threaded -O Server.hs
And running it on across 2 cores:
$ time ./Z +RTS -N2
(2,0)
(3,1)
(5,1)
(7,2)
(11,3)
(13,5)
(17,8)
(19,13)
(23,21)
(29,34)
(31,55)
(37,89)
(41,144)
(43,233)
(47,377)
(53,610)
(59,987)
(61,1597)
(67,2584)
^C interrupted
Easy! The key points to take away:
- parallel and concurrent code is easy in Haskell, much as in Erlang
- multi-core concurrency is well supported, out of the box
- concurrent Haskell is damn fast
And most importantly, there are no monadic headaches!
/home ::
/haskell ::
permalink ::
rss
2007-11-24
One of the true gems of the new GHC 6.8.1 release (besides the
ubiquitous 20% speedup in your compiled code) is that we finally have a
robust solution to tracking down unknown exceptions in Haskell code!
Using the shiny new GHCi debugger, courtesy of Simon Marlow, José
Iborra, Bernie Pope and Andy Gill it is possible to obtain a perfect
backtrace to source location, and display the source, of your failing
code -- the kind you might find if you're a bit sloppy when reasoning
about totality:
$ ./a.out
*** Exception: Prelude.head: empty list
Pretty useless error message. Tracking this down has been a hard
problem using GHC in the past. You had to either:
-
Convince the profiler to give you a stack trace (i.e. the little known
+RTS -xc option)
(buried in the "interested souls" section of the GHC manual
-
Replace calls to head with calls that will fail with a better error
message (maybe using a library such as loch)
-
Get your code into Haskell 98 shape, and use Catch totality
checker to statically analyse your code for partial functions, or maybe
use Hat to trace the code.
All three solutions are quite obscure, the first one works but is tedious,
the second is used by no one, and the third only works in theory.
Yet the problem of partial functions and unknown exceptions persists.
Until now.
Imagine you develop a lovely Haskell program called HsColour, and deep down inside is a
call to 'head':
head :: [a] -> a
head (x:_) = x
head [] = error "Prelude.head: empty list"
that relies on some invariant in your code you forgot to
write a QuickCheck property for. As long as the invariant holds, the
call to head is safe. When run, your program happily marks up Haskell
code as html:
$ runhaskell HsColour.hs -css < source.hs
data Stream a = forall s. Unlifted s =>
Stream !(s -> Step a s)
!s
map :: (a -> b) -> Stream a -> Stream b
map f (Stream next0 s0) = Stream next s0
where
next !s = case next0 s of
Done -> Done
Skip s' -> Skip s'
Yield x s' -> Yield (f x) s'
Lovely stuff. Years go by, and someone joins your project, and starts
refactoring. In the process, they break this crucial invariant. Now your
programs says:
$ runhaskell HsColour.hs -css < source.hs
HsColour.hs: Prelude.head: empty list
Where do you begin to debug this? Not much information. Now, some people, at
this point would mumble something about monads being too hard, and rewrite the
project in Python, but there's no need. Fire up ghci 6.8.1, and set a
breakpoint on exceptions, and set up some command line args for your program:
$ ghci HsColour.hs
*Main> :set -fbreak-on-exception
*Main> :set args "source.hs"
Now run your program with tracing on, and GHCi will stop your program at
the call to error:
*Main> :trace main
Stopped at (exception thrown)
Ok, good. We had an exception... Let's just back up a bit and see where we are.
Watch now as we travel backwards in time through our program, using the
(bizarre, I know) ":back" command:
[(exception thrown)] *Main> :back
Logged breakpoint at Language/Haskell/HsColour/Classify.hs:(19,0)-(31,46)
_result :: [String]
This tells us that immediately before hitting error, we were in the file
Language/Haskell/HsColour/Classify.hs, at line 19. We're in pretty good shape
now. Let's see where exactly:
[-1: Language/Haskell/HsColour/Classify.hs:(19,0)-(31,46)] *Main> :list
18 chunk :: String -> [String]
vv
19 chunk [] = head []
20 chunk ('\r':s) = chunk s -- get rid of DOS newline stuff
21 chunk ('\n':s) = "\n": chunk s
^^
Ok, that's awesome! Our bogus call to head, deep down in the program is somewhere
on those two lines: can you spot it? :) Fixing that line, and we're back in
business. The days of head scratching at head [] failures are over. Now go and
write some QuickCheck properties to ensure this never happens again.
The GHCi debugger can do a lot more than just travel back in time and find your
bugs for you, and it is just one point in the world of declarative debugging.
For more reading have a look at:
Haskell has a debugger, it's there if you need it.
/home ::
/haskell ::
permalink ::
rss
2007-11-03