Convolutional Neural Networks (CNN) are being actively explored for safety- critical applications such as autonomous vehicles and aerospace, where it is essen- tial to ensure the reliability of inference results in the presence of possible memory faults. Traditional methods such as error correction codes (ECC) and Triple Modu- lar Redundancy (TMR) are CNN-oblivious and incur substantial memory overhead and energy cost. This paper introduces in-place zero-space ECC assisted with a new training scheme weight distribution-oriented training. The new method provides the first known zero space cost memory protection for CNNs without compromising the reliability offered by traditional ECC.
Experiments on VGG16, ResNet18, and SqueezeNet validate the effectiveness of the proposed solution. Across all tested scenarios, the method provides protections consistently comparable to those offered by existing hardware ECC logic, while removing all space costs. It hence offers a promising replacement of existing protection schemes for CNNs.
This result is presented in Thirty-third Conference on Neural Information Processing Systems (NeurIPS).
Significance and Impact
As Convolutional Neural Networks (CNNs) are increasingly explored for safety-critical applications such as autonomous vehicles and aerospace, reliability of CNN inference is becoming an important concern. A key threat is memory faults (e.g., bit flips in memory), which may result from environment perturbations, temperature variations, voltage scaling, manufacturing defects, wear-out, and radiation-induced soft errors. These faults change the stored data (e.g., CNN parameters), which may cause large deviations of the inference results.
Existing solutions have resorted to general memory fault protection mechanisms, such as Error Correction Codes (ECC) hardware, spatial redundancy, and radiation hardening. Being CNN-oblivious, these protections incur large costs. ECC, for instance, uses eight extra bits in protecting 64-bit memory; spatial redundancy requires at least two copies of CNN parameters to correct one error (called Triple Modular Redundancy (TMR)); radiation hardening is subject to substantial area overhead and hardware cost. The spatial, energy, and hardware costs are especially concerning for safety-critical CNN inferences; as they often execute on resource-constrained (mobile) devices, the costs worsen the limit on model size and capacity, and increase the cost of the overall AI solution.
To address the fundamental tension between the needs for reliability and the needs for space/energy/cost efficiency, this work proposes the first zero space cost memory protection for CNNs. The design capitalizes on the opportunities brought by the distinctive properties of CNNs. It further amplifies the opportunities by introducing a novel training scheme, Weight Distribution- Oriented Training (WOT), to regularize the weight distributions of CNNs such that they become more amenable for zero-space protection. It then introduces a novel protection method, in-place zero-space ECC, which removes all space cost of ECC protection while preserving protection guarantees.
This study presents in-place zero-space Error Correction Codes (ECC) assisted with a new training scheme named WOT to protect CNN memory. The protection scheme removes all space cost of ECC without compromis- ing the reliability offered by ECC, opening new opportunities for enhancing the accuracy, energy efficiency, reliability, and cost effectiveness of CNN-driven AI solutions.
The core idea of in-place zero-space ECC is to use non-informative bits in CNN parame- ters to store error check bits. For example, the commonly used SEC-DED (64, 57, 1) code uses seven check bits to protect 57 data bits for sin- gle error correction; they together form a 64- bit code word. If seven out of eight consecutive weights are in range [−64, 63], we can then have seven non-informative bits, one per small weight. The essential idea of in-place ECC is to use these non-informative bits to store the error check bits for the eight weights. By embedding the check bits into the data, it can hence avoid all space cost.
In order to embed the check bits into the data, we would like to ensure the large weights appear only at specific places. We formulate this problem as an optimization problem, where our objective is too regularize the positions of large values. The constraint of this optimization problem is that the weights in the l-th convolutional layer have a value in only the range of [-64,63] in order for us to ensure only seven bits are used to store weight values.
Our baseline is a simple quantization aware training (QAT), where we can consider the above problem as storing weights in reduced precision. We combine QAT with throttling, which can make the weights meet the constraint without losing the accuracy of a quantized model.