Jeffrey Vetter

Highlights

XSBench, VEXS, and Shift unit test simulations of a fresh scenario on a P100 GPU. CSMD ORNL Computer Science and Mathematics Division

Enhancing Monte Carlo proxy applications on GPUs

Shown in Figure 1 is a comparison between Shift unit tests, XSBench, and VEXS for a fresh scenario on a P100 GPU. XSBench appears to predict that the hash-accelerated binary search and hash-…

Runtime comparison of translated OpenMP 4.X+ with hand-coded OpenMP. CSMD Computer Science and Mathematics ORNL

CCAMP: OpenMP and OpenACC Interoperable Framework

To evaluate the effectiveness of CCAMP’s OpenACC to OpenMP 4.X+ baseline translation pass, we evaluated the hand-coded OpenMP 4.X+ applications in the SPEC Accel benchmark suite without applying any…

This figure compares the runtime performance of the default SPEC Accel OpenMP benchmarks using the clang compiler (blue bars), and the performance of the same execution after applying CCAMP optimizations (orange bars).

CCAMP: An Integrated Translation and Optimization Framework for OpenACC and OpenMP

High-level programming models are a necessity on future high-performance systems. However, with the diversity of accelerators and hardware vendors, performance portability of applications across the…

MCL performance Computer Science and Mathematics ORNL

The Minos Computing Library: efficient parallel programming for extremely heterogeneous systems

We analyze the overhead introduced by MCL over OpenCL when using similar hardware resources. The goal of this test is to evaluate MCL scheduling overhead, the parallelism exploited by the MCL workers…

Jacobi benchmark with different FPGA Computer Science and Mathematics ORNL

In-Depth Optimization with the OpenACC-to-FPGA Framework on an Arria 10 FPGA

This work examined the directive-based high-level FPGA programming approach implemented in the OpenARC compiler. The experimental results show that multi-threaded and single-threaded kernels can…

Analyzing the suitability of contemporary 3D-stacked PIM architectures for HPC scientific applications

Scaling off-chip bandwidth is challenging due to fundamental limitations such as fixed pin count and plateauing signaling rates. Recently, vendors have turned to 2.5D and 3D stacking to closely…

Implementing Efficient Data Compression and Encryption in a Persistent Key-Value Store for HPC

Recently, persistent data structures, like key-value stores (KVSs), which are stored in an HPC system's nonvolatile memory, provide an attractive solution for a number of emerging challenges like…

Partial UML class diagram for graph-based representation

FLAME: Graph-based Hardware Representations for Rapid and Precise Performance Modeling

The slowdown of Moore’s law has caused an escalation in architectural diversity over the last decade, and agile development of domain-specific heterogeneous chips is becoming a high priority. However…

The OpenACC data model: Preliminary study on its major challenges and implementations

This paper describes how the OpenACC data model is implemented in current OpenACC compilers, ranging from research compilers (OpenUH and OpenARC) to a commercial compiler (the PGI OpenACC compiler).…

Figure 1 Level-Synchronous BFS algorithm designed for EMU architecture

Designing Algorithms for the EMU Migrating-threads-based Architecture

EMU is a novel architecture that provides scalable access to a com- mon partitioned global address space (PGAS) through a simple programming interface. The hardware is hierarchically organized as…

Language-Based Optimizations for Persistence on Nonvolatile Main Memory Systems

Substantial advances in nonvolatile memory (NVM) technologies have motivated widespread integration of NVM into mobile, enterprise, and HPC systems. Recently, considerable research has focused…

NVL-C: Static Analysis Techniques for Efficient, Correct Programming of Non-Volatile Main Memory Systems

Computer architecture experts expect that NVM hierarchies will play a more significant role in future systems including mobile, enterprise, and HPC architectures. With this expectation in mind, we…

Design Quality vs. Level of Representation

FITL: Extending LLVM for the Translation of Fault-Injection Directives

The frequency of hardware errors in HPC systems continues to grow as system designs evolve toward exascale. Tolerating these errors efficiently and effectively will require software-based resilience…

DRAGON: Breaking GPU Memory Capacity Limits with Direct NVM Access

Heterogeneous computing with accelerators is growing in importance in high performance computing (HPC), deep learning (DL), and other areas. Recently, application datasets have expanded beyond the…

Fig. 1. Proposed hybrid method to adapt the computation and synchronization to different wavefront problems and workspace matrices.

Highly Efficient Compensation-based Parallelism for Wavefront Loops on GPUs

Wavefront loops are widely used in many scientific applications, e.g., partial differential equation (PDE) solvers and sequence alignment tools. However, due to the data dependencies in wavefront…

Search