Jeffrey Vetter

Highlights

MCL performance Computer Science and Mathematics ORNL

We analyze the overhead introduced by MCL over OpenCL when using similar hardware resources. The goal of this test is to evaluate MCL scheduling overhead, the parallelism exploited by the MCL workers…

Jacobi benchmark with different FPGA Computer Science and Mathematics ORNL

This work examined the directive-based high-level FPGA programming approach implemented in the OpenARC compiler. The experimental results show that multi-threaded and single-threaded kernels can…

PIM scaling and speedup

Scaling off-chip bandwidth is challenging due to fundamental limitations such as fixed pin count and plateauing signaling rates. Recently, vendors have turned to 2.5D and 3D stacking to closely…

Performance Evaluation

Recently, persistent data structures, like key-value stores (KVSs), which are stored in an HPC system's nonvolatile memory, provide an attractive solution for a number of emerging challenges like…

Partial UML class diagram for graph-based representation

The slowdown of Moore’s law has caused an escalation in architectural diversity over the last decade, and agile development of domain-specific heterogeneous chips is becoming a high priority. However…

OpenACC data model

This paper describes how the OpenACC data model is implemented in current OpenACC compilers, ranging from research compilers (OpenUH and OpenARC) to a commercial compiler (the PGI OpenACC compiler).…

Figure 1  Level-Synchronous BFS algorithm designed for EMU architecture

EMU is a novel architecture that provides scalable access to a com- mon partitioned global address space (PGAS) through a simple programming interface. The hardware is hierarchically organized as…

NVL-C System

Substantial advances in nonvolatile memory (NVM) technologies have motivated widespread integration of NVM into mobile, enterprise, and HPC systems.  Recently, considerable research has focused…

NVL-C System

Computer architecture experts expect that NVM hierarchies will play a more significant role in future systems including mobile, enterprise, and HPC architectures. With this expectation in mind, we…

Design Quality vs. Level of Representation

The frequency of hardware errors in HPC systems continues to grow as system designs evolve toward exascale. Tolerating these errors efficiently and effectively will require software-based resilience…

DRAGON

Heterogeneous computing with accelerators is growing in importance in high performance computing (HPC), deep learning (DL), and other areas. Recently, application datasets have expanded beyond the…

Fig. 1.  Proposed hybrid method to adapt the computation and synchronization to different wavefront problems and workspace matrices.

Wavefront loops are widely used in many scientific applications, e.g., partial differential equation (PDE) solvers and sequence alignment tools. However, due to the data dependencies in wavefront…

Fig. 1.  Single GPU Tests on NVIDIA P100. The results show that indirect addressing outperforms locally direct addressing, and CSoA memory layout outperforms SoA and bundling memory layouts.

GPU performance of the lattice Boltzmann method (LBM) depends heavily on memory access patterns. When LBM is advanced with GPUs on complex computational domains, geometric data is typically accessed…

Fig. 1.  Comparison of directive-based FPGA approach with directive-based CPU and GPU approaches.

Reconfigurable architectures like Field Programmable Gate Arrays (FPGAs) have been used for accelerating computations from several domains because of their unique combination of flexibility,…

Fig. 1.  Tuyere integrates application, mapping, and system knowledge into hardware simulations.

Memory technologies are under active development. Meanwhile, workloads on contemporary computing systems are increasing rapidly in size and diversity. Such dynamics in hardware and software further…