Rizwan Ashraf

Highlights

slowdowns for different jobs with recorded reliability events normalized with respect to other similar  runs where no events are recorded

Reliability, availability, and serviceability (RAS) events are recorded from almost all components in a high-performance computing (HPC) system and therefore provide useful insights into the…

Job execution time vs processor HW and SW events

Reliability, availability and serviceability (RAS) logs of HPC resources, when closely investigated in spatial and temporal dimensions, can provide invaluable information regarding system status,…

Performance Efficient Multiresilience

High performance computing (HPC) applications are affected by multiple types of errors occurring in HPC systems which hinder with their ability to make forward progress and their correctness. The…

shrink image

Analyses of large-scale HPC systems has demonstrated the need to implement process failure resilience in long-running applications, which are susceptible to multiple failures during their execution.…

Pattern composition for detection, containment and mitigation for soft and process failure resilience in GMRES solver

High performance computing (HPC) applications are affected by multiple types of errors occurring on HPC systems which hinders with their ability to make forward progress and their correctness. The…