Toward Semantic Gaze Target Detection

Authors: Samy Tafasca, Anshul Gupta, Victor Bros, Jean-marc Odobez

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental we extend the gaze following task, and introduce a novel architecture that simultaneously predicts the localization and semantic label of the gaze target. We devise a pseudo-annotation pipeline for the Gaze Follow dataset, propose a new benchmark, develop an experimental protocol and design a suitable baseline for comparison. Our method sets a new state-of-the-art on the main Gaze Follow benchmark for localization and achieves competitive results in the recognition task on both datasets compared to the baseline, with 40% fewer parameters.
Researcher Affiliation Academia Samy Tafasca Idiap Research Institute École Polytechnique Fédérale de Lausanne stafasca@idiap.ch
Pseudocode No The paper provides detailed descriptions of the architecture components, data flow, and loss functions, but it does not include any explicitly labeled pseudocode blocks or algorithm listings.
Open Source Code No We hope that our code, datasets, model checkpoints and research insights will pave the way for future research on semantic gaze following.
Open Datasets Yes We use both the Gaze HOI and Gaze Follow [42] datasets. [...] The final dataset used in our experiments features a vocabulary of 463 object labels and 55995 gaze instances which we split into 47214/3781/5000 for the train, val and test sets.
Dataset Splits Yes The final dataset used in our experiments features a vocabulary of 463 object labels and 55995 gaze instances which we split into 47214/3781/5000 for the train, val and test sets.
Hardware Specification Yes All experiments are done on either a single RTX 3090 (24Gb of memory) or H100 (80Gb of memory) depending on memory requirements, and last for 2 to 10 hours each.
Software Dependencies No We use CLIP [40] as our text encoder, and we kept it frozen. [...] The backbone in the gaze encoder is a Res Net-18 pretrained on Gaze360 [28], while the image encoder is a Vi T-base model [12] initialized from a multimodal MAE [3].
Experiment Setup Yes Our architecture processes the input scene image at a resolution of 256 256, and the head crop at 224 224, to produce an output heatmap of 64 64 and a label embedding of 512. The dimension d inside the decoder is set to 96. The ground-truth heatmap uses a gaussian of σ = 3 placed around the gaze point. We use CLIP as the text encoder and keep it frozen. We set the number of blocks in the gaze decoder to 2, and the number of people during training to Np = 1. [...] For the main experiments on Gaze Follow, we use the Adam W optimizer with a learning rate of 2e 4 and weight decay of 0.003. We train for 20 epochs , with a warmup of 4 epochs, and a cosine annealing schedule. The batch size is set to 300, and the loss coefficients for λhm, λlab, and λang, are set to 1000, 1, and 3 respectively.