Christian Engelmann

Highlights

Architecture of the Plexus resilient runtime system, interfacing with programming model runtimes, libraries, system monitoring and job and resource management.

PLEXUS: A Pattern-Oriented Runtime System Architecture for Resilient Extreme-Scale High-Performance Computing Systems

For high-performance computing (HPC) system designers and users, meeting the myriad challenges of next-generation exascale supercomputing systems requires rethinking their approach to application and…

3D Coded SUMMA: Communication-Efficient and Robust Parallel Matrix Multiplication

A novel fault-tolerant parallel matrix multiplication algorithm, called 3D Coded SUMMA, has been developed that achieves higher failure tolerance than replication-based schemes for the same amount of…

Self-stabilizing Connected Components

For the problem of computing the connected components of a graph, this work considers the design of algorithms that are resilient to transient hardware faults, like bit flips. More specifically, it…

A Comprehensive Informative Metric for Analyzing HPC System Status using the LogSCAN Platform

Log processing by Spark and Cassandra-based ANalytics (LogSCAN) is a newly developed analytical platform that provides flexible and scalable data gathering, transformation and computation. One major…

slowdowns for different jobs with recorded reliability events normalized with respect to other similar runs where no events are recorded

Analyzing the Impact of System Reliability Events on Applications in the Titan Supercomputer

Reliability, availability, and serviceability (RAS) events are recorded from almost all components in a high-performance computing (HPC) system and therefore provide useful insights into the…

Job execution time vs processor HW and SW events

A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log

Reliability, availability and serviceability (RAS) logs of HPC resources, when closely investigated in spatial and temporal dimensions, can provide invaluable information regarding system status,…

Performance Efficient Multiresilience using Checkpoint Recovery in Iterative Algorithms

High performance computing (HPC) applications are affected by multiple types of errors occurring in HPC systems which hinder with their ability to make forward progress and their correctness. The…

Frequency of different types of lane degrades over time

Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System

Today's High Performance Computing (HPC) systems are able to deliver performance in order of Petaflops due to fast computing devices, interconnect, and back-end storage systems. HPC systems contain…

Machine Learning Models for GPU Error Prediction in a Large Scale HPC System

In this work, we discover that workload characteristics, certain GPGPU cards, temperature and power consumption could have predictive or associative capabilities with GPGPU errors, but it is non-…

Shrink or Substitute: Handling Process Failures in HPC Systems using In-situ Recovery

Analyses of large-scale HPC systems has demonstrated the need to implement process failure resilience in long-running applications, which are susceptible to multiple failures during their execution.…

Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications

Resilience is one of the key challenges in maintaining high efficiency of future extreme scale supercomputers. Researchers and system practitioners rely on field-data studies to understand…

Pattern composition for detection, containment and mitigation for soft and process failure resilience in GMRES solver

Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing

High performance computing (HPC) applications are affected by multiple types of errors occurring on HPC systems which hinders with their ability to make forward progress and their correctness. The…

Search