Fast Convolutional Nets With fbfft: A GPU Performance Evaluation

Authors: Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun

ICLR 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We examine the performance profile of Convolutional Neural Network (CNN) training on the current generation of NVIDIA Graphics Processing Units (GPUs). We introduce two new Fast Fourier Transform convolution implementations: one based on NVIDIA s cu FFT library, and another based on a Facebook authored FFT implementation, fbfft, that provides significant speedups over cu FFT (over 1.5 ) for whole CNNs. Both of these convolution implementations are available in open source, and are faster than NVIDIA s cu DNN implementation for many common convolutional layers (up to 23.5 for a synthetic kernel configuration). We discuss different performance regimes of convolutions, comparing areas where straightforward time domain convolutions outperform Fourier frequency domain convolutions. We evaluate our relative performance to NVIDIA s cu DNN library (Chetlur et al. (2014)) on over 8, 000 different configurations (Section 4). Figures 1-6 are performance summaries of cu FFT convolution versus cu DNN on a NVIDIA Tesla K40m, averaged across all three passes.
Researcher Affiliation Industry Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino & Yann Le Cun Facebook AI Research 770 Broadway, New York, NY 10003, USA {ntv,jhj,myrhev,soumith,spiantino,yann}@fb.com
Pseudocode Yes Table 1 describes the in-order operations for FFT computation of the forward pass, using the FFT 2D and IFFT 2D operators and Cgemm matrix multiplication.
Open Source Code Yes Both of these convolution implementations are available in open source, and are faster than NVIDIA s cu DNN implementation for many common convolutional layers (up to 23.5 for a synthetic kernel configuration). Our implementation is released as part of the fbcuda and fbcunn opensource libraries at http://github.com/facebook.
Open Datasets Yes In table 3, we show performance for real CNNs, Alex Net (Krizhevsky et al. (2012)) and Over Feat fast (Sermanet et al. (2014)), comparing against cu DNN and cuda-convnet2 (ccn2) kernels in Torch.
Dataset Splits No The paper discusses evaluating performance on Alex Net and Over Feat fast but does not specify the dataset splits (training, validation, test) used for these experiments within its text.
Hardware Specification Yes Figures 1-6 are performance summaries of cu FFT convolution versus cu DNN on a NVIDIA Tesla K40m, averaged across all three passes. Table 3: Alex Net and Over Feat fast performance (K40, ms). Table 4: Representative layer performance (S = 128, K40m).
Software Dependencies Yes We compare our cu FFT convolution results against NVIDIA s cu DNN 1.0 library (Chetlur et al. (2014)). This number represents the throughput a time-domain kernel needs to achieve in order to match our implementation; it is computed as (Sff khkw(h kh + 1)(w kw + 1))/time. This is a metric to compare relative efficiency across problem and padding sizes. In the cases L2, L3 and L4, a time domain convolution would need to exceed the K40m peak of 4.29 Tflop/sec in order to match our throughput. (K40m on CUDA 6.5)
Experiment Setup Yes Thus, we restrict ourselves to a 5-D problem domain {S, f, f , n(= h = w), k(= kh = kw)}. Much of this space is not used in practice. Some areas are perhaps over-emphasized (large S, small k) due to current engineering concerns. We evaluate cu DNN vs cu FFT-based convolution for Table 2 s 8, 232 configurations. Table 2: Configuration elements evaluated DIMENSION SIZES EVALUATED Minibatch size (S) 1, 16, 64, 128 Input filters (f) 1, 4, 16, 64, 96, 128, 256 Output filters (f ) 1, 4, 16, 64, 96, 128, 256 Kernel h/w (k = kh = kw) 3, 5, 7, 9, 11, 13 Output h/w (y = h kh + 1 = w kw + 1) 1, 2, 4, 8, 16, 32, 64