For high-performance computing (HPC) system designers and users, meeting the myriad challenges of next-generation exascale supercomputing systems requires rethinking their approach to application and…
A novel fault-tolerant parallel matrix multiplication algorithm, called 3D Coded SUMMA, has been developed that achieves higher failure tolerance than replication-based schemes for the same amount of…
For the problem of computing the connected components of a graph, this work considers the design of algorithms that are resilient to transient hardware faults, like bit flips. More specifically, it…
Log processing by Spark and Cassandra-based ANalytics (LogSCAN) is a newly developed analytical platform that provides flexible and scalable data gathering, transformation and computation. One major…
Reliability, availability, and serviceability (RAS) events are recorded from almost all components in a high-performance computing (HPC) system and therefore provide useful insights into the…
Reliability, availability and serviceability (RAS) logs of HPC resources, when closely investigated in spatial and temporal dimensions, can provide invaluable information regarding system status,…
High performance computing (HPC) applications are affected by multiple types of errors occurring in HPC systems which hinder with their ability to make forward progress and their correctness. The…
Today's High Performance Computing (HPC) systems are able to deliver performance in order of Petaflops due to fast computing devices, interconnect, and back-end storage systems. HPC systems contain…
In this work, we discover that workload characteristics, certain GPGPU cards, temperature and power consumption could have predictive or associative capabilities with GPGPU errors, but it is non-…
Analyses of large-scale HPC systems has demonstrated the need to implement process failure resilience in long-running applications, which are susceptible to multiple failures during their execution.…
Resilience is one of the key challenges in maintaining high efficiency of future extreme scale supercomputers. Researchers and system practitioners rely on field-data studies to understand…
High performance computing (HPC) applications are affected by multiple types of errors occurring on HPC systems which hinders with their ability to make forward progress and their correctness. The…