This project investigates a series of designs to improve pipeline flexibility and adaptivity, while also increasing performance. We implement our designs using Tensorflow with Horovod, and test it using several large DNNs. Our results show that the CPU time spent during training is reduced by 2-11X. Furthermore, our implementation can achieve up to 10X speedups when CPU core limits are imposed. Our best pipeline also reduces the average power draw of the ensemble training process by 5-16%.
Significance and Impact
Parallel training of a Deep Neural Network (DNN) ensemble on a cluster of nodes is a common practice to train multiple models in order to construct a model with a higher prediction accuracy. Existing ensemble training pipelines can perform a great deal of redundant operations, resulting in unnecessary CPU usage, or even poor pipeline performance. In order to remove these redundancies, we need pipelines with more communication flexibility than existing DNN frameworks provide.
In this study, we analyze a series of queues used to buffer data between each stage in the machine learning pipeline, allowing us to isolate potential bottlenecks. We discover a bottleneck in the preprocessing stage that can hinder DNN training speed. To add flexibility to present frameworks, we modify the Horovod library to support arbitrary MPI group allocation. Using this addition with Tensorflow, we examine three pipeline designs that we refer to as All-Shared, Single-Broadcast, and Multi-Broadcast. These pipelines are constructed from existing MPI collective operations, such as all-gather and broadcast.
Especially, when we train and evaluate multiple models in parallel, we duplicate the machine learning pipeline. We establish three objectives for designing pipelines to increase system efficiency:
1) Eliminate pipeline redundancies through data sharing.
2) Enable sharing by increasing pipeline flexibility.
3) Use increased flexibility to accelerate the pipeline.
Towards these goals, we focus on balancing the computational demand for preprocessing and model training. The key is in making the pipeline more intelligently take advantage of the computing resource in a cluster of nodes to both minimize redundant preprocessing and speeding it up.
This work is publised in SC'18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis.