BCS: Buffered CoScheduling
Buffered CoScheduling is a new approach to the design of the communication layer for large scale parallel computers. The innovative contribution of BCS is that it imposes a specific degree of determinism on the scheduling of the communication in parallel applications.
BCS-MPI is an implementation of the BCS methodology. BCS-MPI orchestrates all communication activities at fixed intervals (timeslices) of a few hundreds of microseconds. At each interval communication is strictly scheduled: only the messages that can be delivered in a given interval and have been globally scheduled are injected to the network. Messages that require more than one interval are chunked in segments and scheduled over multiple intervals. The important aspect of this approach is that at the end of each interval the network is empty of messages. This guarantees that at certain known times during program execution there are no messages in transit; it is in this sense that determinism is imposed.
From the point of view of a checkpointing and rollback recovery mechanism this vastly simplifies the network state: the network is empty, all pending (portions of) messages are known, the remaining state is the set of memory images of the processes of the application, and the checkpointed data itself.
GLOBAL SYNCHRONIZATION PROTOCOL OF BCS_MPI
The implementation of this mechanism is based on a set of communication primitives (BCS core) which are tightly coupled with the primitives provided at hardware level by the network.
The performance and feasibility of this mechanism has been evaluated and validated on a preliminary prototype implemented at user (i.e. not OS) level with most of the code running on the NICs. It is expected that implementation at the system level, e.g. by a Linux kernel module, will be faster.
Despite the constrained communication of BCS-MPI, preliminary results reported for synthetic benchmarks shows that the loss of performance of the application is less than 7.5 percent with a computation granularity of 10ms on 62 processors.
Moreover, the slowdown significantly decreases when the computation granularity is increased because the delay introduced in communication is more than made up by increased time for computation.
Evaluation of the scalability of this mechanism has shown that the slowdown is almost unchanged by the number of processors in the system. The importance of this result is that it provides convincing evidence that BCS-MPI will exhibit very high scalability.
The Figures below show that the performance of BCS-MPI is very good.