Demand for increasingly-higher computing capability is driving a similar growth in compute cluster sizes, soon to be reaching tens of thousands of processors. This growth is not matched however by system software, which has remained largely unchanged from the advent of clusters. The failure of system software to scale and develop in the same rate as the underlying hardware constrains the productivity of these machines by severely limiting their utilization, reliability, and responsiveness. The traditional approach to system software, namely, the use of loosely-coupled independent daemons on each node, is inadequate for the management of large-scale clusters, a problem which is inherently tightly-coupled and requires a high degree of synchronization. One model for large-scale system software is Buffered Coscheduling (BCS), wherein synchronization and scalability are obtained by means of global scheduling of all system activities and collective network operations. BCS represents a new methodology for the design of system software as a single, parallel program using traditional parallel constructs. As such, system software can be made orders of magnitude more scalable, simple, and easy to debug than the existing distributed solutions. The most important aspect of the BCS model and the overlying system software is the buffering and scheduling of all communication, resulting in highly controllable and deterministic system behavior. This chapter describes in detail the implementation of BCS-MPI, an MPI library designed after this model, and shows that the benefits of determinism need not come at a significant performance cost. Furthermore, BCS-MPI comes with a sophisticated monitoring and debugging subsystem that simplifies the analysis of system and application performance, and is covered in detail in this chapter.