Cartoon illustration of communication between simulation elements in experiment with Conduit software.

The parallel and distributed processing capacity of high-performance computing (HPC) clusters continues to grow rapidly and enable profound scientific and industrial innovations. These advances in hardware capacity and economy afford great opportunity, but also pose a serious challenge: developing approaches to effectively harness it.

Software and hardware that relaxes guarantees of correctness and determinism — a so-called best-effort model’’ — have been shown to improve speed. This work distills best-effort communication from the larger issue of best-effort computing. Specifically, we investigate the implications of relaxing synchronization and message delivery requirements. Such a best-effort approach meets the challenges of heterogenous, varying (i.e., due to power management), and generally lower communication bandwidth (relative to compute) expected on future HPC hardware. Notably, such a model presents the possibility of runtime adaptation to effectively utilize available resources given the particular ratio of compute and communication capability at any one moment in any one rack.

Complex biological organisms exhibit characteristic best-effort properties: trillions of cells interact asynchronously while overcoming all but the most extreme failures in a noisy world. As such, bio-inspired algorithms present strong potential to benefit from best-effort communication strategies.

Much exciting work on best-effort computing has incorporated bespoke experimental hardware. However, existing software libraries for traditional HPC hardware do not typically explicitly expose a convenient best-effort communication interface for such work. This work introduces the Conduit library, which facilitates best-effort communication between parallel and distributed processes on existing, commercially-available hardware.

Publications & Software
2022 Best-Effort Communication Improves Performance and Scales Robustly on Conventional Hardware
arXiv
 Authors Matthew Andres Moreno, Charles Ofria Date November 23rd, 2022 DOI Venue arXiv
Abstract

Here, we test the performance and scalability of fully-asynchronous, best-effort communication on existing, commercially-available HPC hardware.

A first set of experiments tested whether best-effort communication strategies can benefit performance compared to the traditional perfect communication model. At high CPU counts, best-effort communication improved both the number of computational steps executed per unit time and the solution quality achieved within a fixed-duration run window.

Under the best-effort model, characterizing the distribution of quality of service across processing components and over time is critical to understanding the actual computation being performed. Additionally, a complete picture of scalability under the best-effort model requires analysis of how such quality of service fares at scale. To answer these questions, we designed and measured a suite of quality of service metrics: simulation update period, message latency, message delivery failure rate, and message delivery coagulation. Under a lower communication-intensivity benchmark parameterization, we found that median values for all quality of service metrics were stable when scaling from 64 to 256 process. Under maximal communication intensivity, we found only minor – and, in most cases, nil – degradation in median quality of service.

In an additional set of experiments, we tested the effect of an apparently faulty compute node on performance and quality of service. Despite extreme quality of service degradation among that node and its clique, median performance and quality of service remained stable.

BibTeX
⎘ copy to clipboard
@misc{moreno2022best,
doi = {10.48550/ARXIV.2211.10897},

url = {https://arxiv.org/abs/2211.10897},

author = {Moreno, Matthew Andres and Ofria, Charles},

keywords = {Distributed, Parallel, and Cluster Computing (cs.DC), FOS: Computer and information sciences, FOS: Computer and information sciences},

title = {Best-Effort Communication Improves Performance and Scales Robustly on Conventional Hardware},

publisher = {arXiv},

year = {2022},

}

Citation
⎘ copy to clipboard

Moreno, M. A., & Ofria, C. (2022). Best-Effort Communication Improves Performance and Scales Robustly on Conventional Hardware. arXiv preprint arXiv:2211.10897.

Supporting Materials

2021 Conduit: A C++ Library for Best-effort High Performance Computing
ACM Workshop on Parallel and Distributed Evolutionary Inspired Methods
 Authors Matthew Andres Moreno, Santiago Rodriguez Papa, Charles Ofria Date May 21st, 2021 DOI Venue ACM Workshop on Parallel and Distributed Evolutionary Inspired Methods
Abstract

Developing software to effectively take advantage of growth in parallel and distributed processing capacity poses significant challenges. Traditional programming techniques allow a user to assume that execution, message passing, and memory are always kept synchronized. However, maintaining this consistency becomes increasingly costly at scale. One proposed strategy is “best-effort computing”, which relaxes synchronization and hardware reliability requirements, accepting nondeterminism in exchange for efficiency. Although many programming languages and frameworks aim to facilitate software development for high performance applications, existing tools do not directly provide a prepackaged best-effort interface. The Conduit C++ Library aims to provide such an interface for convenient implementation of software that uses best-effort inter-thread and inter-process communication. Here, we describe the motivation, objectives, design, and implementation of the library. Benchmarks on a communication-intensive graph coloring problem and a compute-intensive digital evolution simulation show that Conduit’s best-effort model can improve scaling efficiency and solution quality, particularly in a distributed, multi-node context.

BibTeX
⎘ copy to clipboard
@inproceedings{moreno2021conduit,
author = {Moreno, Matthew Andres and Papa, Santiago Rodriguez and Ofria, Charles},
title = {Conduit: A C++ Library for Best-Effort High Performance Computing},
year = {2021},
isbn = {9781450383516},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3449726.3463205},
doi = {10.1145/3449726.3463205},
booktitle = {Proceedings of the Genetic and Evolutionary Computation Conference Companion},
pages = {1795–1800},
numpages = {6},
keywords = {high performance computing, best-effort computing},
location = {Lille, France},
series = {GECCO '21}
}

Citation
⎘ copy to clipboard

Matthew Andres Moreno, Santiago Rodriguez Papa, and Charles Ofria. 2021. Conduit: a C++ library for best-effort high performance computing. In Proceedings of the Genetic and Evolutionary Computation Conference Companion (GECCO ‘21). Association for Computing Machinery, New York, NY, USA, 1795–1800. https://doi.org/10.1145/3449726.3463205

Supporting Materials

2021 Conduit: A C++ Library for Best-effort High Performance Computing
The 6th International Workshop on Modeling and Simulation of and by Parallel and Distributed Systems (MSPDS 2020)
 Authors Matthew Andres Moreno, Santiago Rodriguez Papa, Charles Ofria Date March 12th, 2021 Venue The 6th International Workshop on Modeling and Simulation of and by Parallel and Distributed Systems (MSPDS 2020)
Abstract

Developing software to effectively take advantage of growth in parallel and distributed processing capacity poses significant challenges. Best-effort computing models, which relax synchronization requirements, have been proposed as a strategy to overcome challenges harness high performance computing at extreme scale. Although many programming languages and frameworks aim to facilitate software development for high performance applications, existing prevalent tools do not expose an explicit best-effort interface. The Conduit C++ Library aims to provide a convenient interface for best-effort inter-thread and inter-process communication. Here, we describe the motivation, objectives, design, and implementation of the library.

BibTeX
⎘ copy to clipboard
@inproceedings{moreno2021conduit_hpcs,
author = {Moreno, Matthew Andres and Papa, Santiago Rodriguez and Ofria, Charles},
title = {Conduit: A C++ Library for Best-Effort High Performance Computing},
year = {2021},
booktitle = {The 6th International Workshop on Modeling and Simulation of and by Parallel and Distributed Systems (MSPDS 2020)},
numpages = {2},
keywords = {high performance computing, best-effort computing},
location = {Barcelona, Sapin},
series = {HPCS 2021}
}


Citation
⎘ copy to clipboard

Matthew Andres Moreno, Santiago Rodriguez Papa and Charles Ofria. 2021. Conduit: A C++ Library for Best-Effort High Performance Computing. MSPDS 2020: The 6th International Workshop on Modeling and Simulation of and by Parallel and Distributed Systems.

Supporting Materials

2020 conduit