Christian Engelmann

Highlights

LogSCAN

Log processing by Spark and Cassandra-based ANalytics (LogSCAN) is a newly developed analytical platform that provides flexible and scalable data gathering, transformation and computation. One major…

slowdowns for different jobs with recorded reliability events normalized with respect to other similar  runs where no events are recorded

Reliability, availability, and serviceability (RAS) events are recorded from almost all components in a high-performance computing (HPC) system and therefore provide useful insights into the…

Job execution time vs processor HW and SW events

Reliability, availability and serviceability (RAS) logs of HPC resources, when closely investigated in spatial and temporal dimensions, can provide invaluable information regarding system status,…

Performance Efficient Multiresilience

High performance computing (HPC) applications are affected by multiple types of errors occurring in HPC systems which hinder with their ability to make forward progress and their correctness. The…

Frequency of different types of lane degrades over time

Today's High Performance Computing (HPC) systems are able to deliver performance in order of Petaflops due to fast computing devices, interconnect, and back-end storage systems. HPC systems contain…

Distribution of temperatures

In this work, we discover that workload characteristics, certain GPGPU cards, temperature and power consumption could have predictive or associative capabilities with GPGPU errors, but it is non-…

shrink image

Analyses of large-scale HPC systems has demonstrated the need to implement process failure resilience in long-running applications, which are susceptible to multiple failures during their execution.…

Failures_LSS

Resilience is one of the key challenges in maintaining high efficiency of future extreme scale supercomputers. Researchers and system practitioners rely on field-data studies to understand…

Pattern composition for detection, containment and mitigation for soft and process failure resilience in GMRES solver

High performance computing (HPC) applications are affected by multiple types of errors occurring on HPC systems which hinders with their ability to make forward progress and their correctness. The…