Learn to Pay Attention

Authors: Saumya Jetley, Nicholas A. Lord, Namhoon Lee, Philip H. S. Torr

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental observations provide clear evidence to this effect: the learned attention maps neatly highlight the regions of interest while suppressing background clutter. Consequently, the proposed function is able to bootstrap standard CNN architectures for the task of image classification, demonstrating superior generalisation over 6 unseen benchmark datasets. When binarised, our attention maps outperform other CNN-based attention maps, traditional saliency maps, and top object proposals for weakly supervised segmentation as demonstrated on the Object Discovery dataset. We also demonstrate improved robustness against the fast gradient sign method of adversarial attack.
Researcher Affiliation Academia Saumya Jetley, Nicholas A. Lord, Namhoon Lee & Philip H. S. Torr Department of Engineering Science, University of Oxford {sjetley,nicklord,namhoon,phst}@robots.ox.ac.uk
Pseudocode No The paper describes the method using prose and mathematical equations (e.g., Equation 1, 2, 3), but no clearly labeled pseudocode or algorithm blocks are provided.
Open Source Code Yes We implement and evaluate the above-discussed progressive attention approach as well as the proposed attention mechanism with the VGG architecture using the codebase provided here: https://github.com/szagoruyko/cifar.torch. The code for CIFAR dataset normalisation is included in the repository. For the Res Net architecture, we make the attention-related modifications to the network specification provided here: https://github.com/szagoruyko/wide-residual-networks/tree/fp16/models.
Open Datasets Yes We evaluate the proposed attention models on CIFAR-10 (Krizhevsky & Hinton, 2009), CIFAR100 (Krizhevsky & Hinton, 2009), SVHN (Netzer et al., 2011) and CUB-200-2011 (Wah et al., 2011) for the task of classification. We use the attention-incorporating VGG model trained on CUB200-2011 for investigating robustness to adversarial attacks. For cross-domain classification, we test on 6 standard benchmarks including STL (Coates et al., 2010), Caltech-256 (Griffin et al., 2007) and Action-40 (Yao et al., 2011). We use the Object Discovery dataset (Rubinstein et al., 2013) for evaluating weakly supervised segmentation performance. A detailed summary of these datasets can be found in Table 4.
Dataset Splits No A detailed summary of these datasets can be found in Table 4. Table 4: Summary of datasets used for experiments across different tasks (C: classification, C-c: classification cross-domain, S: segmentation). For example, CIFAR-10 (Krizhevsky & Hinton, 2009) is listed with 'Size (total/train/test/extra) 60,000 / 50,000 / 10,000 / -'. While training and test sizes are provided, an explicit validation split (e.g., percentages or counts for a separate validation set) is not detailed here for reproducibility.
Hardware Specification Yes All models are implemented in Torch and trained with an NVIDIA Titan-X GPU. Training takes around one to two days depending on the model and datasets.
Software Dependencies No All models are implemented in Torch and trained with an NVIDIA Titan-X GPU. The paper mentions 'Torch' as the implementation framework but does not specify its version number or any other software dependencies with version numbers.
Experiment Setup Yes We use a stochastic gradient descent (SGD) optimiser with a batch size of 128, learning rate decay of 10^-7, weight decay of 5 x 10^-4, and momentum of 0.9. The initial learning rate for CIFAR experiments is 1 and for SVHN is 0.1. The learning rate is scaled by 0.5 every 25 epochs and we train over 300 epochs for convergence. For Res Net, the networks are trained using an SGD optimizer with a batch size of 64, initial learning rate of 0.1, weight decay of 5 x 10^-4, and a momentum of 0.9. The learning rate is multiplied by 0.2 after 60, 120 and 160 epochs. The network is trained for 200 epochs until convergence.