reproducibilityindex.ai

Fast Convolutional Nets With fbfft: A GPU Performance Evaluation

Authors: Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun

ICLR 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We examine the performance proﬁle of Convolutional Neural Network (CNN) training on the current generation of NVIDIA Graphics Processing Units (GPUs). We introduce two new Fast Fourier Transform convolution implementations: one based on NVIDIA s cu FFT library, and another based on a Facebook authored FFT implementation, fbfft, that provides signiﬁcant speedups over cu FFT (over 1.5 ) for whole CNNs. Both of these convolution implementations are available in open source, and are faster than NVIDIA s cu DNN implementation for many common convolutional layers (up to 23.5 for a synthetic kernel conﬁguration). We discuss different performance regimes of convolutions, comparing areas where straightforward time domain convolutions outperform Fourier frequency domain convolutions. We evaluate our relative performance to NVIDIA s cu DNN library (Chetlur et al. (2014)) on over 8, 000 different conﬁgurations (Section 4). Figures 1-6 are performance summaries of cu FFT convolution versus cu DNN on a NVIDIA Tesla K40m, averaged across all three passes.
Researcher Affiliation	Industry	Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino & Yann Le Cun Facebook AI Research 770 Broadway, New York, NY 10003, USA {ntv,jhj,myrhev,soumith,spiantino,yann}@fb.com
Pseudocode	Yes	Table 1 describes the in-order operations for FFT computation of the forward pass, using the FFT 2D and IFFT 2D operators and Cgemm matrix multiplication.
Open Source Code	Yes	Both of these convolution implementations are available in open source, and are faster than NVIDIA s cu DNN implementation for many common convolutional layers (up to 23.5 for a synthetic kernel conﬁguration). Our implementation is released as part of the fbcuda and fbcunn opensource libraries at http://github.com/facebook.
Open Datasets	Yes	In table 3, we show performance for real CNNs, Alex Net (Krizhevsky et al. (2012)) and Over Feat fast (Sermanet et al. (2014)), comparing against cu DNN and cuda-convnet2 (ccn2) kernels in Torch.
Dataset Splits	No	The paper discusses evaluating performance on Alex Net and Over Feat fast but does not specify the dataset splits (training, validation, test) used for these experiments within its text.
Hardware Specification	Yes	Figures 1-6 are performance summaries of cu FFT convolution versus cu DNN on a NVIDIA Tesla K40m, averaged across all three passes. Table 3: Alex Net and Over Feat fast performance (K40, ms). Table 4: Representative layer performance (S = 128, K40m).
Software Dependencies	Yes	We compare our cu FFT convolution results against NVIDIA s cu DNN 1.0 library (Chetlur et al. (2014)). This number represents the throughput a time-domain kernel needs to achieve in order to match our implementation; it is computed as (Sff khkw(h kh + 1)(w kw + 1))/time. This is a metric to compare relative efﬁciency across problem and padding sizes. In the cases L2, L3 and L4, a time domain convolution would need to exceed the K40m peak of 4.29 Tﬂop/sec in order to match our throughput. (K40m on CUDA 6.5)
Experiment Setup	Yes	Thus, we restrict ourselves to a 5-D problem domain {S, f, f , n(= h = w), k(= kh = kw)}. Much of this space is not used in practice. Some areas are perhaps over-emphasized (large S, small k) due to current engineering concerns. We evaluate cu DNN vs cu FFT-based convolution for Table 2 s 8, 232 conﬁgurations. Table 2: Conﬁguration elements evaluated DIMENSION SIZES EVALUATED Minibatch size (S) 1, 16, 64, 128 Input ﬁlters (f) 1, 4, 16, 64, 96, 128, 256 Output ﬁlters (f ) 1, 4, 16, 64, 96, 128, 256 Kernel h/w (k = kh = kw) 3, 5, 7, 9, 11, 13 Output h/w (y = h kh + 1 = w kw + 1) 1, 2, 4, 8, 16, 32, 64