OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks

Authors: Pierre Sermanet; Rob Fergus; Yann LeCun; Xiang Zhang; David Eigen; Michael Mathieu

ICLR 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments are conducted on the Image Net ILSVRC 2012 and 2013 datasets and establish state of the art results on the ILSVRC 2013 localization and detection tasks.
Researcher Affiliation Academia Courant Institute of Mathematical Sciences, New York University 719 Broadway, 12th Floor, New York, NY 10003 sermanet,deigen,xiang,mathieu,fergus,yann@cs.nyu.edu
Pseudocode Yes We combine the individual predictions (see Fig. 7) via a greedy merge strategy applied to the regressor bounding boxes, using the following algorithm. (a) Assign to Cs the set of classes in the top k for each scale s 1 . . . 6, found by taking the maximum detection class outputs across spatial locations for that scale. (b) Assign to Bs the set of bounding boxes predicted by the regressor network for each class in Cs, across all spatial locations at scale s. (c) Assign B S s Bs (d) Repeat merging until done: (e) (b 1, b 2) = argminb1 =b2 Bmatch score(b1, b2) (f) If match score(b 1, b 2) > t , stop. (g) Otherwise, set B B\{b 1, b 2} box merge(b 1, b 2)
Open Source Code Yes Along with this paper, we release a feature extractor named Over Feat 1 in order to provide powerful features for computer vision research. Two models are provided, a fast and accurate one. Each architecture is described in tables 1 and 3. We also compare their sizes in Table 4 in terms of parameters and connections. 1http://cilvr.nyu.edu/doku.php?id=software:overfeat:start
Open Datasets Yes We train the network on the Image Net 2012 training set (1.2 million images and C = 1000 classes) [5].
Dataset Splits Yes We apply our network to the Imagenet 2012 validation set using the localization criterion specified for the competition.
Hardware Specification Yes Our network with 6 scales takes around 2 secs on a K20x GPU to process one image
Software Dependencies No No specific software dependencies with version numbers are mentioned. It mentions common ML techniques and components like 'relu', 'max pooling', 'Drop Out', 'softmax', 'stochastic gradient descent', but not software frameworks or libraries with versions.
Experiment Setup Yes Each image is downsampled so that the smallest dimension is 256 pixels. We then extract 5 random crops (and their horizontal flips) of size 221x221 pixels and present these to the network in mini-batches of size 128. The weights in the network are initialized randomly with (µ, σ) = (0, 1 10 2). They are then updated by stochastic gradient descent, accompanied by momentum term of 0.6 and an ℓ2 weight decay of 1 10 5. The learning rate is initially 5 10 2 and is successively decreased by a factor of 0.5 after (30, 50, 60, 70, 80) epochs. Drop Out [11] with a rate of 0.5 is employed on the fully connected layers (6th and 7th) in the classifier.