Researchers in Future Technologies group at Oak Ridge National Laboratory have developed a software framework for efficient execution of GPU kernels that embed iter-thread data dependencies. The framework is composed of a compiler pass and a runtime that is distributed on CPUs and GPUs in the system. The compiler pass extracts the dependence information from existing OpenMP based applications and converts them into a CUDA code. The host-runtime creates a launch-time task-graph and pass it to the GPU, and the device-runtime ensures that the thread-blocks are executed according to the dependencies specified in the task graph.
Significance and Impact
This research has contributed the community in two major ways:
- Juggler runtime uniquely employs an all-in-GPU task-based execution mechanism. Once the kernel is launched, all task-based operations (i.e. task retrieval & insertion and dependency checks) are performed within the device. Prior studies are based on schedulers running on the host, therefore they are largely affected by the communication overhead enforced by the PCI-e bus.
- Juggler compiler is the first study that utilizes OpenMP 4.5 based task dependency declarations for in-GPU fine-granular tasking. Prior OpenMP implementations treated the GPU as a whole, hence under-utilizing multi-processors (SMs) in the GPU.
- The compiler front-end of Juggler implements source-tosource transformations to automatically convert applications written with OpenMP 4.5 task directives to CUDA code that uses our task-based GPU runtime. This is achieved by instrumenting the input source code with calls to the Juggler host APIs.
- Once the Juggler-integrated application is compiled with nvcc and executed, a previously injected inspection code first creates a DAG with the input parameters that are available at application launch. The DAG is then fed into the Juggler device runtime along with other application-related context that is necessary for the execution.
- The device runtime is responsible for assigning tasks in the DAG to workers, executing them by calling the associated user kernels, and resolving the dependences after they are processed.
- Running an application with the Juggler framework requires a minimal effort from the user (e.g., if they want to further optimize the CUDA kernel generated), provided that the application is properly embedded with OpenMP 4.5 task constructs.
- We evaluated our runtime on an NVIDIA Tesla P100 GPU on seven different scientific kernels. Juggler improved performance up to 31% compared to the classic global barrier-based approach.
In this study we have proposed Juggler, a new, dynamic task-based execution scheme for GPGPU applications with data dependences. Different from previous studies, Juggler implements an in-GPU runtime for applications with OpenMP 4.5–based dependences. The runtime uniquely employs in-GPU dependence resolution and task placement. Our experimental evaluation of seven scientific kernels with data dependences on an NVIDIA Tesla P100 GPU showed that Juggler improves kernel execution performance up to 31% when compared to global barrier-based implementation. Our results demonstrate that the conventional GPGPU programming paradigms relying on grid-based execution with global synchronization can be replaced with DAG-based, dependence-aware task processing to increase the performance of scientific applications.