This paper presents a systematic exploration on enabling flexible efficient ensemble training for heterogeneous DNNs. It addresses two-fold challenges. First, it formalizes the essence of the problem into an optimal resource allocation problem, analyzes its computational complexity, and presents an efficient greedy algorithm to effectively map DNNs to GPUs on the fly. Second, it develops a set of techniques to seamlessly integrate distributed data-parallel training of DNN, preprocessing sharing, and runtime DNN-to-GPU assignments together into a software framework, FLEET. FLEET features flexible and efficient communications, and effective runtime resource allocations. Experiments on 100 heterogeneous DNNs on SummitDev demonstrate that FLEET can speed up the ensemble training by 1.12-1.92X over the default training method, and 1.23-1.97X over the state-of-the-art framework that was designed for homogeneous DNN ensemble training.
This study will appear at the Third Conference on Machine Learning and Systems (MLSys) (https://mlsys.org).
Significance and Impact
Parallel training of an ensemble of Deep Neural Networks (DNN) on a cluster of nodes is an effective approach to shorten the process of neural network architecture search and hyper-parameter tuning for a given learning task. Prior efforts have shown that data sharing, where the common preprocessing operation is shared across the DNN training pipelines, saves computational resources and improves pipeline efficiency. Data sharing strategy, however, performs poorly for a heterogeneous set of DNNs where each DNN has varying computational needs and thus different training rate and convergence speed.
This study presents FLEET, a flexible ensemble DNN training framework for efficiently training a heterogeneous set of DNNs. We build FLEET via several technical innovations. We theoretically prove that an optimal resource allocation is NP-hard and propose a greedy algorithm to efficiently allocate resources for training each DNN with data sharing. We integrate data-parallel DNN training into ensemble training to mitigate the differences in training rates. We introduce checkpointing into this context to address the issue of different convergence speeds. Experiments show that FLEET significantly improves the training efficiency of DNN ensembles without compromising the quality of the result.
Recent years have witnessed rapid progress in the development of Deep Neural Networks (DNN) and their successful applications to the understanding of images, texts, and wavelet data from sciences to industry. An essential step to apply a deep learning algorithm to a new data set is the selection of an appropriate network architecture and hyper-parameters. In this step, one needs to train models with various architectures and configurations until identifying the best model for a particular task. An effective strategy for this is to concurrently train a set of DNNs on a cluster of nodes, which is referred to as ensemble training of DNNs. We refer to an ensemble of DNN models with the same architecture as a homogeneous DNN ensemble. Otherwise, the ensemble is called heterogeneous DNN ensemble. This study specifically targets the training of heterogeneous DNN ensemble.
An efficient training an ensemble of heterogeneous DNNs faces two challenges due to the variance of DNN model training from two algorithmic characteristics. The first algorithmic characteristic is varying training rate. Training rate of a DNN is the compute throughput of processing units (such as CPUs and GPUs) used for training the DNN. An ensemble of heterogeneous DNNs contains DNNs with different architectures and configurations. Each DNN in the ensemble could have varying computational needs and thus different training rates (i.e., processing speed) with the same computing resources. If a DNN consumes preprocessed data slower than other DNNs, others will have to wait for the slower one before evicting current set of cached batches when we employ synchronized data fetching for data sharing to ensure that each DNN is trained using the entire dataset. This waiting lowers the utilization of computing resources in the cluster and delays the overall training time of the ensemble.
The second one is varying convergence speed. Due to the differences in network architecture or hyper-parameter settings, some DNNs may require a larger number of epochs (one epoch goes through all data samples once) to converge than others. There can be scenarios where a subset of DNNs in the ensemble have already converged while the shared preprocessing operations have to keep prepossessed data for the remaining DNNs. Resources allocated to these converged DNNs will be under-utilized until the training of all the DNNs is completed.
In order to address these issues, we propose FLEET, a flexible ensemble DNN training framework for efficiently training a heterogeneous set of DNNs. We build FLEET via several technical innovations. First, we formalize the essence of the problem into an optimal resource allocation problem. We analyze the computational complexity of the problem and present an efficient greedy algorithm that groups a subset of DNNs into a unit (flotilla) and effectively maps DNNs to GPUs in a flotilla on the fly. The algorithm incurs marginal runtime overhead while balancing the progressing pace of DNNs. Second, we develop a set of techniques to seamlessly integrate distributed data-parallel training of DNN, preprocessing sharing, and runtime DNN-to-GPU assignments together into FLEET, the first ensemble DNN training framework for heterogeneous DNNs. We introduce checkpointing into this context to address the issue of different convergence speeds. FLEET features flexible and efficient communications and effective runtime resource allocations.