February 15, 2011

Fine-grained parallelism in Determinator

At this years OSDI Brain Ford and his group presented Determinator, a proof-of-concept operating system.[1] They give evidence that Determinator can enforce determinism at the operating system level. They introduce a race-free model of shared state by coping the state at fork and explicitly join. When a new thread is forked, it creates its own memory copy which uses the copy-on-write method for memory accesses. On each explicit join operation of the child thread will be merged into the parent space. Conflicts that might occur will be reported as an error.

Today's operating systems do not provide any guarantees about determinism of their parallel execution.  For example debugging, fault tolerance and security would benefit from a deterministic operating system. All program failures due to race conditions would basically vanish. Current approaches achieve determinism only at high costs. Many of these approaches rely on a deterministic scheduler.

In the original evaluations Determinator performs well, despite for fine-grained parallelism. Especially their LU-decomposition performance is low due to fine-grained parallelism. This issue mainly rises from the fact that no shared memory is available by default and if it is used, the costs to maintain determinism increases. The authors are looking forward to future hardware enhancements. In the following sections we discuss how a change within the architecture could tackle these issues.

Our key idea is that the system could benefit for these workloads from additional parallel programming patterns. Other parallel patterns than fork join could potentially tend to communicate and reuse the forked threads in a more efficient way.

The idea is to integrate additional parallel programming patterns directly into the architecture of the system. Instead of forking only a single thread, we suggest to fork several threads which are connected using a programming pattern. The determinism is enforced by the operating system which supports the programming pattern and has knowledge about its structure. We show how this could be achieved for the pipeline pattern. In [2, 4] two ways are described how LU-decompositions could be turned into a pipelined execution.

Integrating these patterns into the operating system offers several advantages which are led by the usability for the programmer. The determinism would be applied by the underlying system without any additional effort by the programmer. Furthermore optimizations and statistics about programs could be analyzed by the operating system and turned into a performance gain during the execution.

A pipeline consists of many compute units and their ordering. The input of each compute unit is linked to the predecessor's output. Usually this is supported by a shared double-ended queue. Recent research has evolved an algorithm that does not block and use mutual exclusion.[3] In the current fork join programming model each compute unit will be a separate space with a shared-memory region with the predecessor and successor. Inputs for the first compute unit will be provided by an external service. It splits the input into chunks and gives them to the first compute unit. The overall output is delivered after each compute unit has executed the inputs one after the other and the last compute unit delivered the chunk. The more compute units are used, the more parallelism can be achieved.

A pipeline itself is deterministic by definition, when the input is supplied in the same order. This could be achieved by a deterministic split mechanism which is usually the case. We introduce non-deterministic behavior by having multiple instances of each compute unit (compute group). Each input chunk would have to be identified by a total order. The compute group would then compute chunks in the given order. The input queue of the compute group would not apply the ordering by itself. The compute group has to keep a shared counter which is set using the test-and-set command of the processor. Hence the compute group has to share a small memory region too.

Advantages of integrating additional parallel programming patterns are the different characteristics of patterns. We have shown how this applies to the pipeline pattern. It might also apply to patterns like master worker or work stealing.

The characteristics of the pipeline pattern would allow finer grained parallelism than with the fork join pattern. It also allows to reuse the forked pipeline for later use without forking and synchronizing them again. Besides, adding the pipeline pattern also helps to process streams of input data. Currently these workloads would require extensive communication between the child and its parent.

[1] A. Aviram, S.-C.Weng, S. Hu, and B. Ford. Efficient system-enforced deterministic parallelism. In Proceedings of the 9th USENIX conference on Operating systems design and implementation, OSDI'10. USENIX Association, 2010.
[2] F. Desprez, B. Tourancheau, and J. J. Dongarra. Performance complexity of lu factorization with efficient pipelining and overlap on a multiprocessor. Technical report, University of Tennessee, Oak Ridge National Laboratory, LIP, 1994.
[3] D. Hendler, Y. Lev, M. Moir, and N. Shavit. A dynamic-sized nonblocking work stealing deque. Technical report, Tel-Aviv University, Brown University, Sun Microsystems Laboratories, 2005. 
[4] M. Mosleh, S. Setayeshi, and M. Kheyrandish. Presenting a systematic method for lu decomposition of a matrix with linear systolic arrays. In Proceedings of the 2008 International Conference on Advanced Computer Theory and Engineering.

No comments:

Post a Comment