Christian Engelmann

Highlights

Architecture of the Plexus resilient runtime system, interfacing with programming model runtimes, libraries, system monitoring and job and resource management.

For high-performance computing (HPC) system designers and users, meeting the myriad challenges of next-generation exascale supercomputing systems requires rethinking their approach to application and…

3D Coded SUMMA

A novel fault-tolerant parallel matrix multiplication algorithm, called 3D Coded SUMMA, has been developed that achieves higher failure tolerance than replication-based schemes for the same amount of…

Self-stabilizing Connected Components

For the problem of computing the connected components of a graph, this work considers the design of algorithms that are resilient to transient hardware faults, like bit flips. More specifically, it…

LogSCAN

Log processing by Spark and Cassandra-based ANalytics (LogSCAN) is a newly developed analytical platform that provides flexible and scalable data gathering, transformation and computation. One major…

slowdowns for different jobs with recorded reliability events normalized with respect to other similar  runs where no events are recorded

Reliability, availability, and serviceability (RAS) events are recorded from almost all components in a high-performance computing (HPC) system and therefore provide useful insights into the…

Job execution time vs processor HW and SW events

Reliability, availability and serviceability (RAS) logs of HPC resources, when closely investigated in spatial and temporal dimensions, can provide invaluable information regarding system status,…

Performance Efficient Multiresilience

High performance computing (HPC) applications are affected by multiple types of errors occurring in HPC systems which hinder with their ability to make forward progress and their correctness. The…

Frequency of different types of lane degrades over time

Today's High Performance Computing (HPC) systems are able to deliver performance in order of Petaflops due to fast computing devices, interconnect, and back-end storage systems. HPC systems contain…

Distribution of temperatures

In this work, we discover that workload characteristics, certain GPGPU cards, temperature and power consumption could have predictive or associative capabilities with GPGPU errors, but it is non-…

shrink image

Analyses of large-scale HPC systems has demonstrated the need to implement process failure resilience in long-running applications, which are susceptible to multiple failures during their execution.…

Failures_LSS

Resilience is one of the key challenges in maintaining high efficiency of future extreme scale supercomputers. Researchers and system practitioners rely on field-data studies to understand…

Pattern composition for detection, containment and mitigation for soft and process failure resilience in GMRES solver

High performance computing (HPC) applications are affected by multiple types of errors occurring on HPC systems which hinders with their ability to make forward progress and their correctness. The…