Fast Convolutional Nets With fbfft: A GPU Performance Evaluation
Authors: Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun
ICLR 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We examine the performance profile of Convolutional Neural Network (CNN) training on the current generation of NVIDIA Graphics Processing Units (GPUs). We introduce two new Fast Fourier Transform convolution implementations: one based on NVIDIA s cu FFT library, and another based on a Facebook authored FFT implementation, fbfft, that provides significant speedups over cu FFT (over 1.5 ) for whole CNNs. Both of these convolution implementations are available in open source, and are faster than NVIDIA s cu DNN implementation for many common convolutional layers (up to 23.5 for a synthetic kernel configuration). We discuss different performance regimes of convolutions, comparing areas where straightforward time domain convolutions outperform Fourier frequency domain convolutions. We evaluate our relative performance to NVIDIA s cu DNN library (Chetlur et al. (2014)) on over 8, 000 different configurations (Section 4). Figures 1-6 are performance summaries of cu FFT convolution versus cu DNN on a NVIDIA Tesla K40m, averaged across all three passes. |
| Researcher Affiliation | Industry | Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino & Yann Le Cun Facebook AI Research 770 Broadway, New York, NY 10003, USA {ntv,jhj,myrhev,soumith,spiantino,yann}@fb.com |
| Pseudocode | Yes | Table 1 describes the in-order operations for FFT computation of the forward pass, using the FFT 2D and IFFT 2D operators and Cgemm matrix multiplication. |
| Open Source Code | Yes | Both of these convolution implementations are available in open source, and are faster than NVIDIA s cu DNN implementation for many common convolutional layers (up to 23.5 for a synthetic kernel configuration). Our implementation is released as part of the fbcuda and fbcunn opensource libraries at http://github.com/facebook. |
| Open Datasets | Yes | In table 3, we show performance for real CNNs, Alex Net (Krizhevsky et al. (2012)) and Over Feat fast (Sermanet et al. (2014)), comparing against cu DNN and cuda-convnet2 (ccn2) kernels in Torch. |
| Dataset Splits | No | The paper discusses evaluating performance on Alex Net and Over Feat fast but does not specify the dataset splits (training, validation, test) used for these experiments within its text. |
| Hardware Specification | Yes | Figures 1-6 are performance summaries of cu FFT convolution versus cu DNN on a NVIDIA Tesla K40m, averaged across all three passes. Table 3: Alex Net and Over Feat fast performance (K40, ms). Table 4: Representative layer performance (S = 128, K40m). |
| Software Dependencies | Yes | We compare our cu FFT convolution results against NVIDIA s cu DNN 1.0 library (Chetlur et al. (2014)). This number represents the throughput a time-domain kernel needs to achieve in order to match our implementation; it is computed as (Sff khkw(h kh + 1)(w kw + 1))/time. This is a metric to compare relative efficiency across problem and padding sizes. In the cases L2, L3 and L4, a time domain convolution would need to exceed the K40m peak of 4.29 Tflop/sec in order to match our throughput. (K40m on CUDA 6.5) |
| Experiment Setup | Yes | Thus, we restrict ourselves to a 5-D problem domain {S, f, f , n(= h = w), k(= kh = kw)}. Much of this space is not used in practice. Some areas are perhaps over-emphasized (large S, small k) due to current engineering concerns. We evaluate cu DNN vs cu FFT-based convolution for Table 2 s 8, 232 configurations. Table 2: Configuration elements evaluated DIMENSION SIZES EVALUATED Minibatch size (S) 1, 16, 64, 128 Input filters (f) 1, 4, 16, 64, 96, 128, 256 Output filters (f ) 1, 4, 16, 64, 96, 128, 256 Kernel h/w (k = kh = kw) 3, 5, 7, 9, 11, 13 Output h/w (y = h kh + 1 = w kw + 1) 1, 2, 4, 8, 16, 32, 64 |