Highlight

A Comprehensive Informative Metric for Analyzing HPC System Status using the LogSCAN Platform

LogSCAN
Total system log data event count (bottom), SIE with Nodal Map layout (middle) and SIE with Source Type layout (top), each plotted over time. The Nodal Map and Source Type layouts are different record vs. feature views of the log data that are plotted over time. Each of these views is sensitive to different changes in the system’s health. For example, while the overall event count may not change drastically for a certain period of time, drastic changes in event types or locations during this period can be easily detected using the SIE metric.

Achievement

We created the System Information Entropy (SIE) metric to concisely represent supercomputer health status in a time series. SIE comprehensively and sensitively summarizes the characteristics of supercomputer without pre-assumption of relative significances among individual system properties.

Significance and Impact

This new metric aids supercomputer operators in assessing system health status by easily and quickly identifying changes in system health. Different SIEs based on different record vs. feature views of a system’s log data are effective indicators of independent features a system possesses at a given time.

Research Details

  • Used a multi-user Big Data analytics framework – Log processing by Spark and Cassandra-based ANalytics (LogSCAN) – in ORNL’s private cloud of Compute and Data Environment for Science (CADES)
  • Analyzed 3+ years of system log data from ORNL’s Titan supercomputer (January 2015 – March 2018)
  • Applied Principal Component Analysis and Shannon Entropy Theory to calculate different SIEs based on different record vs. feature views of the log data
  • Studied the change of these two SIEs over time to identify changes in system health status
     

Overview

Log processing by Spark and Cassandra-based ANalytics (LogSCAN) is a newly developed analytical platform that provides flexible and scalable data gathering, transformation and computation.

One major challenge is to effectively summarize the status of a complex computer system, such as the Titan supercomputer at the Oak Ridge Leadership Computing Facility (OLCF). Although there is plenty of operational and maintenance information collected and stored in real time, which may yield insights about short- and long-term system status, it is difficult to present this information in a comprehensive form.

In this work, we present system information entropy (SIE), a newly developed metric that leverages the powers of traditional machine learning techniques and information theory. By compressing the multi-variant multi-dimensional event information recorded during the operation of the targeted system into a single time series of SIE, we demonstrate that the historical system status can be sensitively represented concisely and comprehensively.

Given a sharp indicator as SIE, we argue that analytics based on SIE reveals in-depth knowledge about system health status using other sophisticated approaches, such as pattern recognition in the temporal domain or causality analysis incorporating extra independent metrics of the system.
 

Yawei Hui, Byung Hoon (Hoony) Park, and Christian Engelmann. A Comprehensive Informative Metric for Analyzing HPC System Status using the LogSCAN Platform. In Proceedings of the 31st International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops 2018: 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2018, pages 29-38, Dallas, TX, USA, November 16, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-7281-0222-1. DOI 10.1109/FTXS.2018.00007.