Many parallel algorithms require efficient support for reduction collectives. Over the years, researchers have developed optimal reduction algorithms by taking into account system size, data size, and complexities of reduction operations. However, all of these algorithms have assumed the fact that the reduction processing takes place on the host CPU. Modern Network Interface Cards (NICs) sport programmable processors with substantial memory and thus introduce a fresh variable into the equation. This raises the following interesting challenge: Can we take advantage of modern NICs to implement fast reduction operations? In this paper, we take on this challenge in the context of large-scale clusters. Through experiments on the 960-node, 1920-processor ASCI Linux Cluster (ALC) located at the Lawrence Livermore National Laboratory, we show that NIC-based reductions indeed perform with reduced latency and improved consistency over host-based algorithms for the common case and that these benefits scale as the system grows. In the largest configuration tested ---1812 processors--- our NIC-based algorithm can sum a single element vector in 73 microseconds with 32-bit integers and in 118 microseconds with 64-bit floating-point numbers. These results represent an improvement, respectively, of 121% and 39% with respect to the production level MPI library.