Harnessing Best-Effort Computing to Enable Dynamic Artificial Life Simulations at Scale

Cartoon illustration of communication between simulation elements in experiment with Conduit software.

The parallel and distributed processing capacity of high-performance computing (HPC) clusters continues to grow rapidly and enable profound scientific and industrial innovations. These advances in hardware capacity and economy afford great opportunity, but also pose a serious challenge: developing approaches to effectively harness it.

Software and hardware that relaxes guarantees of correctness and determinism — a so-called ``best-effort model’’ — have been shown to improve speed. This work distills best-effort communication from the larger issue of best-effort computing. Specifically, we investigate the implications of relaxing synchronization and message delivery requirements. Such a best-effort approach meets the challenges of heterogenous, varying (i.e., due to power management), and generally lower communication bandwidth (relative to compute) expected on future HPC hardware. Notably, such a model presents the possibility of runtime adaptation to effectively utilize available resources given the particular ratio of compute and communication capability at any one moment in any one rack.

Complex biological organisms exhibit characteristic best-effort properties: trillions of cells interact asynchronously while overcoming all but the most extreme failures in a noisy world. As such, bio-inspired algorithms present strong potential to benefit from best-effort communication strategies.

Much exciting work on best-effort computing has incorporated bespoke experimental hardware. However, existing software libraries for traditional HPC hardware do not typically explicitly expose a convenient best-effort communication interface for such work. This work introduces the Conduit library, which facilitates best-effort communication between parallel and distributed processes on existing, commercially-available hardware.

Publications & Software

2024 Trackable Island-model Genetic Algorithms at Wafer Scale
The Genetic and Evolutionary Computation Conference

Download
View at Publisher

Authors	Matthew Andres Moreno, Connor Yang, Emily Dolson, Luis Zaman
Date	May 6th, 2024
DOI	10.1145/3638530.3664090
Venue	The Genetic and Evolutionary Computation Conference

Abstract

Emerging ML/AI hardware accelerators, like the 850,000 processor Cerebras Wafer-Scale Engine (WSE), hold great promise to scale up the capabilities of evolutionary computation. However, challenges remain in maintaining visibility into underlying evolutionary processes while efficiently utilizing these platforms’ large processor counts. Here, we focus on the problem of extracting phylogenetic information from digital evolution on the WSE platform. We present a tracking-enabled asynchronous island-based genetic algorithm (GA) framework for WSE hardware. Emulated and on-hardware GA benchmarks with a simple tracking-enabled agent model clock upwards of 1 million generations a minute for population sizes reaching 16 million. This pace enables quadrillions of evaluations a day. We validate phylogenetic reconstructions from these trials and demonstrate their suitability for inference of underlying evolutionary conditions. In particular, we demonstrate extraction of clear phylometric signals that differentiate wafer-scale runs with adaptive dynamics enabled versus disabled. Together, these benchmark and validation trials reflect strong potential for highly scalable evolutionary computation that is both efficient and observable. Kernel code implementing the island-model GA supports drop-in customization to support any fixed-length genome content and fitness criteria, allowing it to be leveraged to advance research interests across the community.

BibTeX

⎘ copy to clipboard

@inproceedings{moreno2024trackable_gecco,
  author = {Matthew Andres Moreno and Connor Yang and Emily Dolson and Luis Zaman},
  title = {Trackable Island-model Genetic Algorithms at Wafer Scale},
  pages = {101-102},
  isbn = {9798400704956},
  year = {2024},
  publisher= {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3638530.3664090},
  doi = {10.1145/3638530.3664090},
  booktitle= {Proceedings of the Genetic and Evolutionary Computation Conference Companion},
  numpages = {2},
  location = {Melbourne, VIC, Australia},
  series = {GECCO '24}
}

Citation

⎘ copy to clipboard

Matthew Andres Moreno, Connor Yang, Emily Dolson, and Luis Zaman. 2024. Trackable Island-model Genetic Algorithms at Wafer Scale. In Proceedings of the Companion Conference on Genetic and Evolutionary Computation (GECCO ‘24 Companion). Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3638530.3664090

Supporting Materials

2024 Trackable Agent-based Evolution Models at Wafer Scale
The 2024 Conference on Artificial Life

Download
View at Publisher

Authors	Matthew Andres Moreno, Connor Yang, Emily Dolson, Luis Zaman
Date	April 16th, 2024
DOI	10.1162/isal_a_00830
Venue	The 2024 Conference on Artificial Life

Abstract

Continuing improvements in computing hardware are poised to transform capabilities for in silico modeling of cross-scale phenomena underlying major open questions in evolutionary biology and artificial life, such as transitions in individuality, eco-evolutionary dynamics, and rare evolutionary events. Emerging ML/AI-oriented hardware accelerators, like the 850,000 processor Cerebras Wafer Scale Engine (WSE), hold particular promise. However, practical challenges remain in conducting informative evolution experiments that efficiently utilize these platforms’ large processor counts. Here, we focus on the problem of extracting phylogenetic information from agent-based evolution on the WSE platform. This goal drove significant refinements to decentralized in silico phylogenetic tracking, reported here. These improvements yield order-of-magnitude performance improvements. We also present an asynchronous island-based genetic algorithm (GA) framework for WSE hardware. Emulated and on-hardware GA benchmarks with a simple tracking-enabled agent model clock upwards of 1 million generations a minute for population sizes reaching 16 million agents. We validate phylogenetic reconstructions from these trials and demonstrate their suitability for inference of underlying evolutionary conditions. In particular, we demonstrate extraction, from wafer-scale simulation, of clear phylometric signals that differentiate runs with adaptive dynamics enabled versus disabled. Together, these benchmark and validation trials reflect strong potential for highly scalable agent-based evolution simulation that is both efficient and observable. Developed capabilities will bring entirely new classes of previously intractable research questions within reach, benefiting further explorations within the evolutionary biology and artificial life communities across a variety of emerging high-performance computing platforms.

BibTeX

⎘ copy to clipboard

@inproceedings{moreno2024trackable,
  author = {Matthew Andres Moreno and Connor Yang and Emily Dolson and Luis Zaman},
  title = {Trackable Agent-based Evolution Models at Wafer Scale},
  booktitle = {The 2024 Conference on Artificial Life},
  collection = {ALIFE 2024},
  publisher = {MIT Press},
  year = {2024},
  month = {07},
  doi={10.1162/isal_a_00830},
  url={https://doi.org/10.1162/isal_a_00830},
  numpages={12},
  pages={87-98},
}

Citation

⎘ copy to clipboard

Moreno, M. A., Yang, C., Dolson, E., & Zaman, L. (2024). Trackable Agent-based Evolution Models at Wafer Scale. In The 2024 Conference on Artificial Life. MIT Press. https://doi.org/10.1162/isal_a_00830

Supporting Materials

2022 Best-Effort Communication Improves Performance and Scales Robustly on Conventional Hardware
arXiv

Download
View at Publisher

Authors	Matthew Andres Moreno, Charles Ofria
Date	November 23rd, 2022
DOI	10.48550/arXiv.2211.10897
Venue	arXiv

Abstract

Here, we test the performance and scalability of fully-asynchronous, best-effort communication on existing, commercially-available HPC hardware.

A first set of experiments tested whether best-effort communication strategies can benefit performance compared to the traditional perfect communication model. At high CPU counts, best-effort communication improved both the number of computational steps executed per unit time and the solution quality achieved within a fixed-duration run window.

Under the best-effort model, characterizing the distribution of quality of service across processing components and over time is critical to understanding the actual computation being performed. Additionally, a complete picture of scalability under the best-effort model requires analysis of how such quality of service fares at scale. To answer these questions, we designed and measured a suite of quality of service metrics: simulation update period, message latency, message delivery failure rate, and message delivery coagulation. Under a lower communication-intensivity benchmark parameterization, we found that median values for all quality of service metrics were stable when scaling from 64 to 256 process. Under maximal communication intensivity, we found only minor – and, in most cases, nil – degradation in median quality of service.

In an additional set of experiments, we tested the effect of an apparently faulty compute node on performance and quality of service. Despite extreme quality of service degradation among that node and its clique, median performance and quality of service remained stable.

BibTeX

⎘ copy to clipboard

@misc{moreno2022best,
  doi = {10.48550/ARXIV.2211.10897},
  url = {https://arxiv.org/abs/2211.10897},
  author = {Moreno, Matthew Andres and Ofria, Charles},
  keywords = {Distributed, Parallel, and Cluster Computing (cs.DC), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Best-Effort Communication Improves Performance and Scales Robustly on Conventional Hardware},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

Citation

⎘ copy to clipboard

Moreno, M. A., & Ofria, C. (2022). Best-Effort Communication Improves Performance and Scales Robustly on Conventional Hardware. arXiv preprint arXiv:2211.10897.

Supporting Materials

2021 Conduit: A C++ Library for Best-effort High Performance Computing
ACM Workshop on Parallel and Distributed Evolutionary Inspired Methods

Download
View at Publisher

Authors	Matthew Andres Moreno, Santiago Rodriguez Papa, Charles Ofria
Date	May 21st, 2021
DOI	10.1145/3449726.3463205
Venue	ACM Workshop on Parallel and Distributed Evolutionary Inspired Methods

Abstract

Developing software to effectively take advantage of growth in parallel and distributed processing capacity poses significant challenges. Traditional programming techniques allow a user to assume that execution, message passing, and memory are always kept synchronized. However, maintaining this consistency becomes increasingly costly at scale. One proposed strategy is “best-effort computing”, which relaxes synchronization and hardware reliability requirements, accepting nondeterminism in exchange for efficiency. Although many programming languages and frameworks aim to facilitate software development for high performance applications, existing tools do not directly provide a prepackaged best-effort interface. The Conduit C++ Library aims to provide such an interface for convenient implementation of software that uses best-effort inter-thread and inter-process communication. Here, we describe the motivation, objectives, design, and implementation of the library. Benchmarks on a communication-intensive graph coloring problem and a compute-intensive digital evolution simulation show that Conduit’s best-effort model can improve scaling efficiency and solution quality, particularly in a distributed, multi-node context.

BibTeX

⎘ copy to clipboard

@inproceedings{moreno2021conduit,
  author = {Moreno, Matthew Andres and Rodriguez Papa, Santiago and Ofria, Charles},
  title = {Conduit: A C++ Library for Best-Effort High Performance Computing},
  year = {2021},
  isbn = {9781450383516},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3449726.3463205},
  doi = {10.1145/3449726.3463205},
  booktitle = {Proceedings of the Genetic and Evolutionary Computation Conference Companion},
  pages = {1795–1800},
  numpages = {6},
  keywords = {high performance computing, best-effort computing},
  location = {Lille, France},
  series = {GECCO '21}
}

Citation

⎘ copy to clipboard

Matthew Andres Moreno, Santiago {Rodriguez Papa}, and Charles Ofria. 2021. Conduit: a C++ library for best-effort high performance computing. In Proceedings of the Genetic and Evolutionary Computation Conference Companion (GECCO ‘21). Association for Computing Machinery, New York, NY, USA, 1795–1800. https://doi.org/10.1145/3449726.3463205

Supporting Materials

2021 Conduit: A C++ Library for Best-effort High Performance Computing
The 6th International Workshop on Modeling and Simulation of and by Parallel and Distributed Systems (MSPDS 2020)

Download
View at Publisher

Authors	Matthew Andres Moreno, Santiago Rodriguez Papa, Charles Ofria
Date	March 12th, 2021
Venue	The 6th International Workshop on Modeling and Simulation of and by Parallel and Distributed Systems (MSPDS 2020)

Abstract

Developing software to effectively take advantage of growth in parallel and distributed processing capacity poses significant challenges. Best-effort computing models, which relax synchronization requirements, have been proposed as a strategy to overcome challenges harness high performance computing at extreme scale. Although many programming languages and frameworks aim to facilitate software development for high performance applications, existing prevalent tools do not expose an explicit best-effort interface. The Conduit C++ Library aims to provide a convenient interface for best-effort inter-thread and inter-process communication. Here, we describe the motivation, objectives, design, and implementation of the library.

BibTeX

⎘ copy to clipboard

@inproceedings{moreno2021conduit_hpcs,
  author = {Moreno, Matthew Andres and Rodriguez Papa, Santiago and Ofria, Charles},
  title = {Conduit: A C++ Library for Best-Effort High Performance Computing},
  year = {2021},
  booktitle = {The 6th International Workshop on Modeling and Simulation of and by Parallel and Distributed Systems (MSPDS 2020)},
  numpages = {2},
  keywords = {high performance computing, best-effort computing},
  location = {Barcelona, Sapin},
  series = {HPCS 2021}
}

Citation

⎘ copy to clipboard

Matthew Andres Moreno, Santiago Rodriguez Papa and Charles Ofria. 2021. Conduit: A C++ Library for Best-Effort High Performance Computing. MSPDS 2020: The 6th International Workshop on Modeling and Simulation of and by Parallel and Distributed Systems.

Supporting Materials

2020 conduit
header-only C++ library

View on GitHub

Authors	Matthew Andres Moreno, Santiago Rodriguez Papa, Charles Ofria
Date	January 1st, 2020
Venue	header-only C++ library

C++ library that wraps intra-thread, inter-thread, and inter-process communication in a uniform, modular, object-oriented interface, with a focus on asynchronous high-performance computing applications.

BibTeX

⎘ copy to clipboard

@inproceedings{moreno2021conduit,
  author = {Moreno, Matthew Andres and Rodriguez Papa, Santiago and Ofria, Charles},
  title = {Conduit: A C++ Library for Best-Effort High Performance Computing},
  year = {2021},
  isbn = {9781450383516},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3449726.3463205},
  doi = {10.1145/3449726.3463205},
  booktitle = {Proceedings of the Genetic and Evolutionary Computation Conference Companion},
  pages = {1795–1800},
  numpages = {6},
  keywords = {high performance computing, best-effort computing},
  location = {Lille, France},
  series = {GECCO '21}
}

Citation

⎘ copy to clipboard

Matthew Andres Moreno, Santiago Rodriguez Papa, and Charles Ofria. 2021. Conduit: a C++ library for best-effort high performance computing. In Proceedings of the Genetic and Evolutionary Computation Conference Companion (GECCO ‘21). Association for Computing Machinery, New York, NY, USA, 1795–1800. https://doi.org/10.1145/3449726.3463205

Supporting Materials