Dragon framework allows GPU kernels to access very large data stored in NVMs without the need for explicit data management.
Significance and Impact
Dragon addresses fallbacks of the state-of-the-art CUDA-UVM by transiently allowing scientific kernels to process very large data and eliminates the programming effort.
- We modify NVIDIA Pascal driver and insert calls to use NVM as the source while serving GPU page faults.
- We employ several optimizations like `fault-coalescing’, `lazy-writes’ and `parallel NVM access’ to improve perf.
Heterogeneous computing with accelerators is growing in importance in high performance computing (HPC), deep learning (DL), and other areas. Recently, application datasets have expanded beyond the memory capacity of these accelerators, and often beyond the capacity of their hosts. Meanwhile, nonvolatile memory (NVM) storage has emerged as a pervasive component on nearly all computing systems including HPC systems because NVM provides massive amounts of memory capacity at affordable cost and power. Currently, for accelerator applications to use NVM, they must manually orchestrate data movement across multiple memories. This effort typically requires careful restructuring of the application, and it only performs well for applications with simple data access behaviors.
To address this issue, we have developed DRAGON, a solution that enables all classes of GP-GPU applications to transparently compute on terabyte datasets residing in NVM while ensuring the integrity of data buffers as necessary for NVM. DRAGON leverages the page-faulting mechanism on the recent NVIDIA GPUs by extending capabilities of CUDA Unified Memory (UM). Further, DRAGON improves overall performance by dynamically optimizing accesses to NVM. We empirically evaluate DRAGON on NVIDIA P100 GPU and a 2.4 TB Micron 9100 NVMe card using traditional HPC kernels and popular DL workloads; our experimental results show that DRAGON transparently expands memory capacity and utilizes Linux's page-cache mechanism to obtain additional speedups up to 2.3x against CUDA-UM via automated I/O, data transfer, and computation overlapping.