ACDC: A Structured Efficient Linear Layer

Authors: Marcin Moczulski, Misha Denil, Jeremy Appleyard, Nando de Freitas

ICLR 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we show that it can indeed be successfully interleaved with Re LU modules in convolutional neural networks for image recognition. Our experiments also study critical factors in the training of these structured modules, including initialization and depth.
Researcher Affiliation Collaboration 1University of Oxford 2NVIDIA 3CIFAR
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. It shows derivative equations in Section 4, but these are not presented as an algorithm.
Open Source Code Yes Torch implementation of ACDC is available at https://github.com/mdenil/acdc-torch
Open Datasets Yes In particular we use the Caffe Net architecture4 for Image Net (Deng et al., 2009).
Dataset Splits Yes In particular we use the Caffe Net architecture4 for Image Net (Deng et al., 2009). While specific percentages are not given, ImageNet is a standard benchmark with well-defined train/validation splits, implying their usage.
Hardware Specification Yes The processor used to benchmark the ACDC layer was an NVIDIA Titan X.
Software Dependencies No The paper mentions "The NVIDIA library cu FFT" but does not provide a specific version number for it or any other software dependencies.
Experiment Setup Yes The model was trained using the SGD algorithm with learning rate 0.1 multiplied by 0.1 every 100,000 iterations, momentum 0.65 and weight decay 0.0005. The output from the last convolutional layer was scaled by 0.1, and the learning rates for each matrix A and D were multiplied by 24 and 12. All diagonal matrices were initialized from N(1, 0.061) distribution. No weight decay was applied to A or D. Additive biases were added to the matrices D, but not to A, as this sufficed to provide the ACDC layer with a bias terms just before the Re LU non-linearities. Biases were initialized to 0. To prevent the model from overfitting dropout regularization was placed before each of the last 5 SELL layers with dropout probability equal to 0.1.