A comparison of different approaches to deterministic parallel programming in Haskell
Introduction
When I started out with Parallel Haskell I saw that there where many different ways of doing it (Eval Monad, Par Monad, Accelarate, Repa) and along with that came a whole lot of new terminology, as it often is with Haskell. With this tutorial I hope I can help prospective students of Parallel Haskell understand a part of it that I consider to be key to understanding the rest of the ways. So if you wonder what the heck the difference is between Conservative parallelism and Speculative evaluation, why we need the Eval Monad when we already had a good model with par
and seq
, or what the difference between the Par Monad and the Eval Monad in practice is, how the spark pool works and nevertheless WHAT IT IS, continue reading…
This tutorial assumes some prior knowledge of functional programming in Haskell (specially function composition), what Monads are and Lazy evaluation (i.e the difference between Normal Form and Weak Head Normal Form). I will mostly focus on par/seq/pseq, Strategies, the Eval Monad, and then the Par Monad. I will also defer intentionally from some important aspects of parallel computation like the NFData
type and the force
function (even if I will slightly touch upon it). I think in the beginning it would only add more confusion to get into it and make the reader miss the forest for the trees. I would recommend a student to start with this tutorial and then afterwards move on to these topics to have a separation of concerns ;-).
Deterministic, parallel, what?
We distinguish between parallelism and concurrency. A parallel program is simply a comuptation of a problem that is split into different parts, where each part is computed in parallel on different CPU:s like when rendering an image. A concurrent program is a program that is split into independent threads that can do the same thing or different things (not necessarily in parallel) and can communicate with eachother, a server taking requests or sending replies, for instance.
A parallel program is by definition deterministic since the result of it always should be the same regardless of wheather you run it today or tomorrow. A concurrent program is non-deterministic by nature since the time for the individual computations on the separate threads can´t be known beforehand. In this tutorial we only focus on deterministic parallelism, for which Haskell is specially very well suited.
A Brief history lesson
par
and seq
was introduced into Haskell before the Monad had come along. This is the main reason for why there is a Eval Monad today with a similar rpar
and rseq
. When monads came along the old way was found to be insufficient. But before we look at the new model we will start by looking at the old models components, par
and seq
. These are the most low level parts of parallel Haskell, the other parallel models build on top of these. We start with par.
par :: a -> b -> b
par
simply sparks its first argument and returns its second argument. This function only tells the compiler that the first argument can be evaluated in parallel with something else, nothing more. It does not say with what or when or that it must be evaluated in parallel at all. The spark could still be garbage collected and thus be evaluated in the normal sequential way. This is why we also have seq
.
seq :: a -> b -> b
seq
helps us with the evaluation order. It says that only b can be evaluated if a has been evaluated. The use could be illustrated when doing a heavy computation like calculating a fibonacci series. The crux is that this evauation will be to WHNF (Weak Head Normal Form) so the work will in most cases be lesser than you think. Here is an example.
maximum' [x] = x
maximum' (x:xs) = max x (maximum' xs)
minimum' [x] = x
minimum' (x:xs) = min x (minimum' xs)
homogenList [] = error "Lack of elements"
homogenList [x,y] = x == y
homogenList xs = (min `par` max) `seq` (min == max)
where min = minimum' xs
max = maximum' xs
This is a stupid example, but still, it illustrates the point. In the example par
evaluates to max and seq
evaluates to min == max
so the whole expression evaluates to a boolean. Now this definition of homogenList
must surely mean that min and max are evaluated in parallel before their results are compared for equalitey, RIGHT?
Well, think again…
Remember that par
does not make any guaranties of when to run the first argument, only telling the runtime system that it is possible to do it in parallel if there are idle processors ready to work. So while the processor starts evaluating max it will probably start evaluating min in parallel in this case, but there are still no guarantees. 2 good rules of thumb when thinking of using par
are:
- The value of 'a' in
par a b
is likely to be required later, but not immediately. - 'a' is not a trivial computation, so the cost of running it in parallel does not overshadow the benefits obtained by running it in parallel.
Some definitions you must understand
Control-oriented parallelism
forceList :: [a] -> ()
forceList [] = ()
forceList (x:xs) = x `seq` forceList xs
pQuicksort [] = []
pQuicksort [x] = [x]
pQuicksort (x:xs) = (forceList losort) `par`
(forceList hisort) `par`
losort ++ (x:hisort)
where
losort = pQuicksortF [y|y <- xs, y < x]
hisort = pQuicksortF [y|y <- xs, y >= x]
pQuicksort
is an example of (divide-and-conquer) control-oriented parallelism where part of the functions are evaluated in parallell (since x is not evaluated in parallel). The use of forceList
may be slightly confusing but you can think of it as forcing the List argument to be fully evaluated to Normal Form. This is also what the function force
does, btw. Forcing the evaluation of parts of the datastructure is often necessary when doing parallel programming in Haskell because if you dont that lazy basterd of Haskell will only evaluate expressions to WHNF in parallel => Not much Parallelism. Also remember that you dont want to do too much forcing since then you loose the beautiful aspect of laziness, not doing unecessary stuff. But we wont speak of that any more, promise, and I think this is all you need to know by now. Let´s stick to the point.
Data-oriented parallelism
Data-oriented parallelism is an alternative approach where elements of a data structure are evaluated in parallel. It is about doing the same computation to every element of the data structure and can therefore be done independently from eachother.
parMap :: (a -> b) -> [a] -> [b]
parMap f [] = []
parMap f (x:xs) = fx `par` fxs `seq` (fx:fxs)
where
fx = f x
fxs = parMap f xs
fx is sparked, before recursing down the list (fxs), only returning the first constructor of the result list after every element has been sparked.
3 important aspects
Parallelism control which specifies what threads should be created, and in what order, using par and pseq.
Evaluation degree which specifies how much evaluation each thread should perform, i.e how lazy you should let it be. In the examples above, forcing functions were used to describe the evaluation degree.
Thread granularity is important to spark only those expressions where the cost of evaluation exceeds the overhead of creating threads. This is usually controlled with the help of a threshold accompanying the function telling it when it should stop sparking new threads.
Conservative parallelism and Speculative evaluation
Conservative parallelism defines parallelized programs where the evaluation degree is the same as in the lazy counterpart. Speculative evaluation is when the evaluation-degree may usefully be more strict than the lazy function. So the more strictness is allowed compared to the lazy counterpart the more speculative the program is (The more of the program is evaluated unlazily). BTW, that´s why we also have pseq
and not only seq
. pseq
is semantically identical to seq
but is not strict in both its arguments, only in its first argument.
Parallelizing a program with strategies
par
, seq
and pseq
are the main building blocks of the original parallel model. On top of this we have the Strategies abstraction, that helps us separate the algorithm that solves our problem from the strategy for parallelizing that (no pun intended). Strategies are composable, meaning that smaller strategies can be combined with other stratgies to form newer bigger strategies.
type Strategy a = a -> ()
So a strategy is simply a function from a -> ()
and all it can do is evaluate parts of 'a' and return unit. To apply a strategy we normally use the using
function.
using :: a -> Strategy a -> a
using x s = s x `pseq` x
Notice what the function does. It takes the argument x forces, does something and returns x. This something could be done in parallel or not, the function is not interested, only that it does it sequentially (therefore the pseq
). Here is an quick example of how to use it.
parList :: Strategy a -> Strategy [a]
parList strat [] = ()
parList strat (x:xs) = strat x `par` (parList strat xs)
parMap :: Strategy b -> (a -> b) -> [a] -> [b]
parMap strat f xs = map f xs `using` parList strat
This shows how strategies can be composed (parList takes a strategy and returns a strategy for lists), and it also shows how strategies can separate the algorithm from the evaluation strategy (on the left of using
we have the algorithm for lists and on the right the strategy).
Memory management and the Sparkpool
So what does it actually mean that the expression 'a' in par a b
is sparked? How is it implementation look like?
Well, in practicality it means that a pointer in the Sparkpool is created that points to the same expression in your program (the heap) so the runtime system knows what the programmer wants to run in parallel. A spark is a kind of lightweight thread (a thread spawned on application level with help of an API and therefore not a OS threadx). The sparkpool consist of all the expressions that the idle processors can take from and evaluate in parallel. If the sparkpool would be empty and a processor become idle it could "steal" a spark from another processors queue of sparks. So whenever an idle processor starts evaluating a spark it only needs the pointer to start evaluating it, very simple, isn´t it?
The Garbage Collector can treat sparks as either WEAK or ROOT. WEAK means that the garbage collector will throw away sparks if they are not referenced by some other pointer in the heap. If we apply a WEAK policy for the sparks then no parallelism will occur in the case of, for instance, parList
since all the elements in the list are only sparked for the purpose of being evaluated in parallel. ROOT means that the garbage collector will always consider sparks as being live, and because not all expressions will be evaluated this means that the garbage collector will retain unecessarily many sparks which leads to a space leak.
Fizzled sparks are sparks that already refer to values and cant no longer be evaluated in parallel. This could be because of the time the value of an expression was needed no idle processors was available and the expression had to be evaluated in the normal sequential order. The garbage collector should throw away all fizzled sparks since they are unnecessary but since the expressions are wrapped in expressions of 'using' strat x
and therefore are only partly evaluated they are almost never fizzled (refering to a value instead of an expression). The new parListWHNF
is noncompositional in contrast to parList which makes the strategy apply directly to the datastructure and makes it possible to fizzle.
Speculative parallelism will under the ROOT policy become a space leak, whereas under WEAK policy will be reclaimed by the garbage collector, if not used, and therefore we can only use the WEAK policy. So what we want at this point is to somehow reclaim unecessary sparks but without reclaiming the useful ones.
So, in conclusion when a spark is created there is a pointer in the sparkpool pointing to the spraked expression but nothing in the heap (your program) that points to the sparked expression. In your heap you only have the pointer to the expression you passed to the strategy. And this is the main problem. We need somehow to still have a pointer to the expression in our program to the same expression that the spark in the sparkpool points to so that the garbage collector can do its normal work with the space leaks: the unnecessary unevaluated expressions and the fizzled sparks.
Eval Monad to the rescue
The root of the problem with the old strategies was that it returned the unit type (). By just delegating the expression to the spark pool there was no way of reaching the expression in the heap. Therefore the authors of the new model created the Eval Monad which makes it possible to return the new expression together with the sparked computation, so that the sparked expression now is reachable from both spark pool and the heap. Sounds pretty sweet, huh. Well, yes and it is…
This means that we can actually do specaulative parallelism, since a speculative spark will only be retained as long as it is reachable from the heap, thereafter it will be garbage collected even if it still is pointed to by the sparkpool. Ok, enough talk, this is mama´s brand new bag!
data Eval a = Done a
instance Monad Eval where
return x = Done x
Done x >>= k = k x
runEval :: Eval a -> a
runEval (Done a) = a
Eval is our new Evaluation Monad. And anothing important difference is that the new strategies are now the basic building blocks.
rpar :: Strategy a
rpar x = x ‘par‘ return x
rseq :: Strategy a
rseq x = x ‘pseq‘ return x
And this is how we use them
homogenList [] = error "Lack of elements"
homogenList [x,y] = x == y
homogenList xs = runEval $ do
min <- rpar (minimum' xs)
max <- rseq (maximum' xs)
return (min == max)
Here we see the stuff we were discussing before. min and max are now the new sparked expressions that we could not reach before and consequently not garbage collect. min and max will now not be garbage collected as long as our program is using does variables. Nice to know, huh. Yes it is… Yes it is.
Also, before we did not know when, and with what the spark would be parallelized with, but now the evaluation order is in respect to eachother which brings back some of the control to the programmer. The using function to separate the strategy from the algorithm now looks like this.
using :: a -> Strategy a -> a
x ‘using‘ s = runEval (s x)
parList
and parMap
that we looked at before now are defined in the following manner.
parList :: Strategy a -> Strategy [a]
parList s = evalList (rpar . runEval . s)
evalList :: Strategy a -> Strategy [a]
evalList s [] = return []
evalList s (x:xs) = do
x’ <- s x
xs’ <- evalList s xs
return (x’:xs’)
parMap strat f xs = map f xs `using` parList strat
All this is sound very promising, I think I´ll buy it… Wait, there are unfortunately still some problems with this. Remember the determinism stuff I mentioned before? Well, unfortunately this new model breaks those guarantees of determinism that COULD be kind of important, of course depending on the program, if it´s responsible for the cooling of a nuclear reactor or cooling of a sweaty monkeys butt. Here is what I mean.
a `par` b == b
is still true.
x `using` strat == x
is FALSE. In the original strategies model it was true but now 'strat' could return a different value. Look back at the definition of using
and think about it for a while. In that monadic computation a different value could be returned. That was also the reason for creating the Eval Monad. The authors wanted the a new value that contained the sparked expression. But now there is no way of controlling weather that value is the same as before.
Wait, don´t worry. There is a remedy to this. But it will cost you, just a little bit. The cost is that you won´t be able to construct your own strategies now, but you will still be able to use them. This is done by restricting the user to a subset of the API that is safe for use so that this guarantee of determinism is upheld. Most uses of the strategies will only need this safe API to do their thing. This doesn´t sound so bad, does it? Consider the gains.
Par Monad?
WTF! A NEW Monad for parallelism, what was wrong with the one I just saw? I guess that you think that by now you are missing some big point. But don´t worry. You aren´t. The reason for the Par Monad is to gain some other advantages that the Eval Monad misses, as it is being used for lazy datastructures.
The Par Monad could be said to handle fully evaluated bulk data, where the programmer knows laziness won´t help. The Eval Monad or strategies consumes lazy datastructures, so the programmer needs to know where to force the evaluation and where to let it be non-strict. With the List type this is often not so hard since we know that the head is often evaluated by patternmatching or something, while the tail is zipping pina coladas. It becomes harder to guess this, though, the more complex the datastructure becomes.
The Par Monad is a high abstraction that helps us with this.
newtype Par a
instance Applicative Par
instance Monad Par
runPar :: Par a -> a
The Monad does not do any IO or something crazy like that, it only runs the parallelism of the computation and returns the expression. Like the Eval Monads strategies we are abstracted from the details so that the computations are guaranteed to be deterministic. To create parallelism we use fork
.
fork :: Par () -> Par ()
We use the same terminonlogy as in concurrent programming. We say that function fork
is the parent of the argument (the child). To get the results from the child to the parent we use a I-stucture called IVar
.
data IVar a
new :: Par (IVar a)
get :: IVar a -> Par a
put :: NFData a => IVar a -> a -> Par ()
put
takes an IVar writes to it a value 'a' to it. This can only be done once. get
takes the IVar and returns the value when put
has computed it. Remember that this is done in parallel. So this models what is called futures and promises. Do you see that NFData
type in put
, wonder why its there? You maybe noticed it before? Well it is a class type that says that the argument a can be forced to Normal Form (NF, get it?). The reason for having it is that put
is strict and evaluates its argument to Normal Form. If you don´t want this you can use the less strict little brother put_
, that evaluates to WHNF. This is an example of how it works.
homogenList [] = error "Lack of elements"
homogenList [x,y] = x == y
homogenList xs = runpar $ do
minVar <- new
maxVar <- new
fork (put minVar (minimum' xs))
fork (put maxVar (maximum' xs))
max <- get maxVar
min <- get minVar
return (min == max)
fork
and IVar
form a dataflow network.
Another example of when
Some other comparisons with Strategies
Some "behind the scenes"-looks tells us that the runPar
function is a little bit more expensive than the runEval
, that is free of charge. So when using the Par monad you should try to put the Par monad around to all the places that need parallelism and avoid multiple calls to runPar
. Because of this high overhead cost you should try to control ranularity more carefully with the Par Monad by putting a threshold on when a forking should stop. Where to put this threshold is usually found by expermintation with threadscope. The Eval monad has more diagnostics in ThreadScope. There are graphs that show different aspects of sparks: creation rate, conversion rate, and so on. The Par monad is not currently integrated with ThreadScope.
Thing not mentioned but worth looking up
High performance, regular, multi-dimensional, shape polymorphic parallel arrays: Repa
GPU programming: Accelerate
I hope you enjoyed the tutorial and that some of the confusion was diminished.
References
Paper: Algorithm + Strategy = Parallelism P. Trinder, K. Hammond, H.-W. Loidl, and S. Peyton Jones
Paper: Seq no more: Better Strategies for Parallel Haskell S. Marlow, P. Maier, H.-W. Loidl, M. Aswad, and P. W. Trinde
Paper: A Monad for Deterministic Parallelism S. Marlow, R. Newton, and S. Peyton Jones
Paper: Parallel and Concurrent Programming in Haskell S. Marlow
hackage: pseq/par
Paper: Deterministic parallel programming with Haskell D. Coutts and A. Löh
Book: Parallel and Concurrent Programming in Haskell S. Marlow