Where are they looking?

Authors: Adria Recasens, Aditya Khosla, Carl Vondrick, Antonio Torralba

NeurIPS 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The quantitative evaluation shows that our approach produces reliable results, even when viewing only the back of the head. While our method outperforms several baseline approaches, we are still far from reaching human performance on this task.
Researcher Affiliation Academia Adri a Recasens Aditya Khosla Carl Vondrick Antonio Torralba Massachusetts Institute of Technology {recasens, khosla, vondrick, torralba}@csail.mit.edu
Pseudocode No The paper does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Our model, code and dataset are available for download at http://gazefollow.csail.mit.edu.
Open Datasets Yes Our model, code and dataset are available for download at http://gazefollow.csail.mit.edu. We used several major datasets that contain people as a source of images: 1, 548 images from SUN [19], 33, 790 images from MS COCO [13], 9, 135 images from Actions 40 [20], 7, 791 images from PASCAL [4], 508 images from the Image Net detection challenge [17] and 198, 097 images from the Places dataset [22].
Dataset Splits Yes We use about 4, 782 people of our dataset for testing and the rest for training. We ensured that every person in an image is part of the same split, and to avoid bias, we picked images for testing such that the fixation locations were uniformly distributed across the image.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No We implemented the network using Caffe [10].
Experiment Setup Yes The last convolutional layer of the saliency pathway has a 1 1 256 convolution kernel (i.e., K = 256). The remaining fully connected layers in the gaze pathway are of sizes 100, 400, 200, and 169 respectively. The saliency map and gaze mask are 13 13 in size (i.e., D = 13), and we use 5 shifted grids of size 5 5 each (i.e., N = 5). For learning, we augment our training data with flips and random crops with the fixation locations adjusted accordingly.