We are interested both in techniques for maintaining the graph, and in proving bounds for how much needs to be updated for different algorithms.
Our initial work in this area is described in the following paper.
Presents the first real-time multiprocessor garbage collection algorithm with provable bounds on time and space. The algorithm requires at most 2 (R(1 + 2/k) + N + 5PD) memory locations, where P is the number of processors, R is the maximum reachable space during a computation, N is the maximum number of reachable objects, D is the maximum depth of any data structure, and k is a parameter specifying how many locations are copied each time a location is allocated. Furthermore the client processes are never stopped for more than time proportional to k non-blocking machine instructions.
Our research involves finding theoretically space- and time-efficient scheduling algorithms for multithreaded programs, including programs with nested parallelism and synchronization variables. In addition to proving the space and time bounds of the resulting parallel schedules, we demonstrate that they are efficient in practice. We have implemented a runtime system that uses our algorithm to schedule threads in computations with nested parallelism. The following papers describe our results.
Describes a parallel algorithm for scheduling based on the notion of a parallel depth-first schedule. It proves that any computation which does W work (total number of operations), has D depth (longest chain of dependences, or critical path), and requires S1 space if run sequentially can be scheduled to run on P processors in O(W/P + D) time and S1 + O(P D) total space. We show how to generate such a schedule for any computation off-line and give an algorithm for nested-parallel computations that can generate an efficient on-line schedule. The previous best known bounds on space were S1 P based on results by Blumofe and Leiserson. Ours are therefore the first bounds that only require an additive term instead of a multiplicative factor over the sequential space.
Describes an asynchronous version of the scheduling algorithm and the first implementation of our scheduling ideas. This version is more practical the than the above mentioned on-line scheduler since it allows processors to run mostly asynchronously and does not require any preemption (threads can run unimpeded until they decide to terminate). Experimental results are given for a variety of applications. The experiments show improved space usage over previous schedulers, and approximately equal performance.
Describes an on-line scheduling algorithm for languages with synchronization variables (e.g. futures, i-structures, events). Presents the first space bounds for such languages since previous results only applied to nested-parallel or strict languages. The technique uses the parallel depth-first schedule as described in our original paper, but requires balanced-trees to properly maintain the thread priorities on-line.
Applies the ideas from the previous paper to the NESL language by specifying a provably-efficient language implementation. The paper describes a "profiling" semantics for NESL---which is an operational semantics extended with cost measures. The profiling semantics tracks work and depth using by returning a DAG, and tracks sequential space by threading a store in a fixed order. The paper then proves relationship of these costs measures to runtime on various machines models.
Describes a new divide-and-conquer parallel algorithm for 2-d Delaunay triangulation, and its implementation in NESL and MPI. The algorithm has O(n log n) work and has O(log^3 n) depth. The main idea of the algorithm is to do all the partitioning work on the way down the recursion as opposed to on the way up (as done by Shammos and Hoey's algorithm -- the original O(n log n) time sequential algorithm). Experimental results show a 32-fold speedup on 64-processors (on a Cray T3D). The speedup is relative to the fastest sequential implementation of 2-D Delaunay we could find.
Describes parallel algorithms for union, intersection and difference on ordered sets using random balanced binary trees (treaps), and presents experimental results for the algorithms on the SGI power challenge and Sun Enterprise server. For two sets of size n and m (m <= n) the algorithms run in expected O(m log (n/m)) work and O(log n) depth (parallel time) on an EREW PRAM with scan operations. The experiments show speedups of 6.3 to 6.8 speedup on 8 processors of the SGI power challenge.
Perhaps the most important feature of NESL is the model it supplies for analyzing costs. In particular it defines costs in terms of work (total number of operations) and depth (longest sequence of dependences, or critical path) and defines rules for composing these costs across expression. In general when making calls sequentially the sum is taken of both the work and depth, and when making calls in parallel the maximum is taken over the work and depth. This idea of using work and depth as cost measures has been adopted by other languages, such as the CILK programming language.
In addition to the NESL home page the following papers describe our work on NESL.
Describes and motivates the two main ideas behind NESL: nested data parallelism, and the language based performance model. It goes through several examples of parallel algorithms written and analyzed in NESL, including quicksort, primes, sparse matrix vector multiply, FFT, quickmedian, and a convex-hull algorithm.
Outlines the implementation of NESL and gives performance numbers on a variety of parallel machines.
Applies the ideas from the previous paper to the NESL language by specifying a provably-efficient language implementation. The paper describes a "profiling" semantics for NESL---which is an operational semantics extended with cost measures. The profiling semantics tracks work and depth using by returning a DAG, and tracks sequential space by threading a store in a fixed order. The paper then proves relationship of these costs measures to runtime on various machines models.
The language definition, including a complete list of functions. It does not contain the operational semantics (see below).
Describes how to set up the NESL environment and how to use the various features in NESL (such as tracing, profiling and using remote machines).
Describes a profiling semantics for the lambda-calculus assuming that in an expression e1 e2 the two subexpressions e1 and e2 can be evaluated in parallel, but that they must both fully evaluate before the result e1 can be applied to e2. This is called call-by-value parallelism. The semantics tracks work and depth as the two cost measures. The paper shows several results including the following.
This extends the previous work to languages in which in an expression e1 e2 the function resulting from evaluating e1 can start running before e2 is complete. This is the parallelism available with futures and with lenient languages. Such parallelism can reduce the depth relative to call-by-value parallelism, but implementing such parallelism requires synchronizing on individual variables. The paper describes an implementation that guarantees that any expression that evaluates in W work and D depth, will run on P processors in O(W/P + D log P) time on a CRCW PRAM.
Formally defines a cost model for NESL using a profiling semantics. This model defines work (W), depth (D), and sequential space (S1). It then proves a mapping from this model onto various machine models. For example it shows that on a CREW PRAM a computation will run in O(W/P + D log P) time and S1 + O(D P) space.
Pipelining has been used in the design of many PRAM algorithms to reduce their asymptotic running time, but properly maintaining the pipeline is very complicated for the programmer. This paper shows how futures (a parallel language construct) can be used to implement pipelining without requiring the user to code it explicitly, allowing for much simpler code and more asynchronous execution. It describes and analyzes three algorithms using this technique: a parallel merging algorithm on trees, a parallel version of insertion into and deletion from randomized balanced trees (treaps), and insertion into 2-3 trees.
This work considers issues of memory performance in shared memory multiprocessors for algorithms with irregular memory accessesses (e.g. pointer based algorithms). In particular it develops a model to account for the effect of contention and delay at the memory banks. The work was motivated by observed discrepancies between predicted and actual performance in a number of irregular algorithms implemented for the Cray C90 when the memory contention at a particular location is high. We show experimentally that our model much better accounts for the performance of such algorithms than previous models.