Reducing Overfitting in Deep Networks by Decorrelating Representations

Authors: Michael Cogswell, Faruk Ahmed, Ross Girshick, Larry Zitnick, Dhruv Batra

ICLR 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across a range of datasets and network architectures show that this loss always reduces overfitting (as indicated by the difference between train and val performance), and better generalization.
Researcher Affiliation Collaboration Michael Cogswell Virginia Tech Blacksburg, VA cogswell@vt.edu Faruk Ahmed Université de Montréal Montréal, Quebec, Canada faruk.ahmed@umontreal.ca Ross Girshick Facebook AI Research (FAIR) Seattle, WA rbg@fb.com Larry Zitnick Microsoft Research Seattle, WA larryz@microsoft.com Dhruv Batra Virginia Tech Blacksburg, VA dbatra@vt.edu
Pseudocode No The paper describes the mathematical formulation of the De Cov loss but does not include any pseudocode or algorithm blocks.
Open Source Code No The paper mentions using Caffe implementations and a Caffe Model Zoo link (https://gist.github.com/mavenlin/d802a5849de39225bcc6) for existing models, but it does not state that the authors' own implementation of the De Cov method is open-source or provide a link to their code.
Open Datasets Yes Our experiments encompass a range of datasets (MNIST (Le Cun et al., 1995), CIFAR10/100 (Krizhevsky & Hinton, 2009), Image Net (Deng et al., 2009))
Dataset Splits Yes Hyper-parameters (loss weights for De Cov and weight decay) are chosen by grid search on the standard train/val split. (CIFAR10) ... We use the same architecture as the base architecture for CIFAR10 and hold out the last 10,000 of the 50,000 train examples for validation. (CIFAR100) ... The last 50,000 of the ILSVRC 2012 train images are held out for validation. (ImageNet)
Hardware Specification No The paper mentions 'Faster computers' and 'GPU support by NVIDIA' in general terms, and states 'Using Cu DNNv3, Alex Net with 128x128 inputs takes 103ms averaged over 50 runs to compute a forward and backward pass.' (Footnote 1) but does not provide specific CPU or GPU models used for the experiments.
Software Dependencies Yes Using Cu DNNv3, Alex Net with 128x128 inputs takes 103ms averaged over 50 runs to compute a forward and backward pass. (Footnote 1)
Experiment Setup Yes Note that we set the Dropout rate to 0.5 as suggested by Srivastava et al. (2014). ... Hyper-parameters (loss weights for De Cov and weight decay) are chosen by grid search on the standard train/val split. ... The best De Cov weight (0.1) is consistent for a range of hidden activation sizes in this dataset... Our implementation comes from Caffe. In particular, it uses a fixed schedule that multiplies the learning rate by 1/10 every 100,000 iterations... We do not use early stopping and do not perform color augmentation.