A^2-Nets: Double Attention Networks

Authors: Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng Yan, Jiashi Feng

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive ablation studies and experiments on both image and video recognition tasks for evaluating its performance.
Researcher Affiliation Collaboration Yunpeng Chen National University of Singapore chenyunpeng@u.nus.edu Yannis Kalantidis Facebook Research yannisk@fb.com Jianshu Li National University of Singapore jianshu@u.nus.edu Shuicheng Yan Qihoo 360 AI Institute National University of Singapore eleyans@nus.edu.sg Jiashi Feng National University of Singapore elefjia@nus.edu.sg
Pseudocode No The paper provides a computational graph in Figure 2, but it does not include pseudocode or clearly labeled algorithm blocks.
Open Source Code No Code and trained models will be released on Git Hub soon.
Open Datasets Yes Kinetics [12] video recognition dataset, Image Net-1k [13] image classification dataset, UCF-101 [20]
Dataset Splits Yes For image classification, we report standard single model single 224 224 center crop validation accuracy, following [9, 10]. The UCF-101 contains about 13, 320 videos from 101 action categories and has three train/test splits.
Hardware Specification Yes All experiments are conducted using a distributed K80 GPU cluster
Software Dependencies No We use MXNet [3] to experiment on the image classification task, and Py Torch [18] on video classification tasks. The paper mentions the names of the software used but does not specify their version numbers.
Experiment Setup Yes The base learning rate is set to 0.2 and is reduced with a factor of 0.1 at the 20k-th, 30k-th iterations, and terminated at the 37k-th iteration. We use 32 GPUs per experiment with a total batch size of 512 training from scratch. The base learning rate is set to 0.1 and decreases with a factor of 0.1 when training accuracy is saturated. The network takes 8 frames (sampling stride: 8) as input and is trained for 32k iterations with a total batch size of 512 using 64 GPUs. The initial learning rate is set to 0.04 and decreased in a stepwise manner when training accuracy is saturated.