DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition
Authors: Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, Trevor Darrell
ICML 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be repurposed to novel generic tasks. Our main result is the empirical validation that a generic visual feature based on a convolutional network weights trained on Image Net outperforms a host of conventional visual representations on standard benchmark object recognition tasks, including Caltech-101 (Fei-Fei et al., 2004), the Office domain adaptation dataset (Saenko et al., 2010), the Caltech-UCSD Birds fine-grained recognition dataset (Welinder et al., 2010), and the SUN-397 scene recognition database (Xiao et al., 2010). |
| Researcher Affiliation | Academia | Jeff Donahue , Yangqing Jia , Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, Trevor Darrell {JDONAHUE,JIAYQ,VINYALS,JHOFFMAN,NZHANG,ETZENG,TREVOR}@EECS.BERKELEY.EDU UC Berkeley & ICSI, Berkeley, CA, USA |
| Pseudocode | No | The paper does not contain any sections explicitly labeled "Pseudocode" or "Algorithm", nor does it present structured, code-like steps for any method. |
| Open Source Code | Yes | Our implementation, decaf, is publicly available1. 1https://github.com/ UCB-ICSI-Vision-Group/decaf-release In addition, we have released the network parameters used in our experiments to allow for out-of-the-box feature extraction without the need to re-train the large network2. |
| Open Datasets | Yes | Our main result is the empirical validation that a generic visual feature based on a convolutional network weights trained on Image Net outperforms a host of conventional visual representations on standard benchmark object recognition tasks, including Caltech-101 (Fei-Fei et al., 2004), the Office domain adaptation dataset (Saenko et al., 2010), the Caltech-UCSD Birds fine-grained recognition dataset (Welinder et al., 2010), and the SUN-397 scene recognition database (Xiao et al., 2010). |
| Dataset Splits | Yes | Our instance of the model attains an error rate of 42.9% on the ILSVRC-2012 validation set 2.2% shy of the 40.7% achieved by Krizhevsky et al. (2012). The model entered into the competition actually achieved a top-1 validation error rate of 36.7% by averaging the predictions of 7 structurally identical models that were initialized and trained independently. We trained only a single instance of the model; hence we refer to the single model error rate of 40.7%. In each evaluation, the classifier, a logistic regression (Log Reg) or support vector machine (SVM), is trained on a random set of 30 samples per class (including the background class), and tested on the rest of the data, with parameters cross-validated for each split on a 25 train/5 validation subsplit of the training data. |
| Hardware Specification | Yes | The Tesla K20 used in our experiments was donated by the NVIDIA Corporation. Our implementation is able to process about 40 images per second with an 8-core commodity machine when the CNN model is executed in a minibatch mode. |
| Software Dependencies | No | Specifically, we adopted open-source Python packages such as numpy/scipy for efficient numerical computation, with parts of the computation-heavy code implemented in C and linked to Python. The paper mentions software names like Python, numpy, scipy, and cuda-convnet, but does not specify version numbers for any of them. |
| Experiment Setup | Yes | We refer to Krizhevsky et al. (2012) for a detailed discussion of the architecture and training protocol, which we closely followed with the exception of two small differences in the input data. First, we ignore the image s original aspect ratio and warp it to 256 256, rather than resizing and cropping to preserve the proportions. Secondly, we did not perform the data augmentation trick of adding random multiples of the principle components of the RGB pixel values throughout the dataset. At training time, this technique works by randomly setting half of the activations (here, our features) in a given layer to 0. At test time, all activations are multiplied by 0.5. All images are preprocessed using the procedure described for the ILSVRC images in Section 3, taking features on the center 224 224 crop of the 256 256 resized image. |