Hard-Attention for Scalable Image Classification
Authors: Athanasios Papadopoulos, Pawel Korus, Nasir Memon
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare our model against hard-attention baselines on Image Net, achieving higher accuracy with less resources (FLOPs, processing time and memory). We further test our model on f Mo W dataset, where we process satellite images of size up to 896 896 px, getting up to 2.5x faster processing compared to baselines operating on the same resolution, while achieving higher accuracy as well. |
| Researcher Affiliation | Academia | 1Tandon School of Engineering, New York University 2AGH University of Science and Technology |
| Pseudocode | No | The paper describes the architecture and learning rule in text and equations, but it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states that TNet and Bag Net-77 are implemented in TF 2, but it does not provide a specific link or explicit statement about the release of their own source code. |
| Open Datasets | Yes | Image Net [13] consists of natural images from 1, 000 classes. We use the ILSVRC 2012 version, which consists of 1, 281, 167 training and 50, 000 validation images. |
| Dataset Splits | Yes | We use the ILSVRC 2012 version, which consists of 1, 281, 167 training and 50, 000 validation images." and "They are split in 363, 572 training, 53, 041 validation and 53, 473 testing images. |
| Hardware Specification | Yes | We use a single NVIDIA Quadro RTX 8000 GPU, with 64 GB of RAM, and 20 CPUs to mitigate data pipeline impact. |
| Software Dependencies | Yes | For Saccader and DRAM we use public implementations [21] in Tensor Flow (TF) 1. TNet and Bag Net-77 are implemented in TF 2. |
| Experiment Setup | Yes | We train TNet with 2 processing levels on images of 224 224 px using class labels only. We train for 200 epochs using the Adam optimizer [35] with initial learning rate 10 4, that we drop once by a factor of 0.1. We use dropout (keep probability 0.5) in the last layer of feature extraction. We use per-feature regularization with λc = λr = 0.3. We attend to a fixed number of 3 locations. |