Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Toward Human Deictic Gesture Target Estimation

Authors: Xu Cao, Pranav Virupaksha, Sangmin Lee, Bolin Lai, Wenqi Jia, Jintai Chen, James M. Rehg

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that our approach of incorporating gaze cues significantly improves performance in identifying deictic gesture targets, validating the intuition that where they look helps convey what they point at.
Researcher Affiliation Academia 1University of Illinois Urbana-Champaign 2Georgia Institute of Technology 3Korea University 4The Hong Kong University of Science and Technology (Guangzhou) EMAIL
Pseudocode No The paper describes the model architecture and training stages but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes All data, code will be made publicly available upon acceptance. Code of Trans Gesture is available at Git Hub.com/Iroh Xu/Trans Gesture.
Open Datasets Yes In this paper, we address the previously mentioned gaps by introducing Gesture Target, a novel benchmark dataset for deictic gesture target estimation, and a Transformer-based gaze-aware gesture model, Trans Gesture, that learn to predict the targets of human gestures by leveraging multimodal cues. Gesture Target is, to our knowledge, the first large-scale dataset focused on images of people engaged in deictic gestures (such as pointing, reaching, or showing an object) with annotations of the intended target of each gesture. All data, code will be made publicly available upon acceptance. We first pre-train the gaze decoder on the Gaze Follow dataset[23] to precisely extract gaze-related features.
Dataset Splits Yes All models are trained and evaluated on the proposed Gesture Target dataset, using an 80:20 split for training and testing. Following practices from PASCAL VOC Action Recognition [71], where ground-truth bounding boxes for subjects are provided during both training and evaluation, we similarly assume availability of the subject person s bounding box at both train and test time. For gaze decoder pre-training, we use the official Gaze Follow training dataset [23].
Hardware Specification Yes All experiments are conducted with 1 NVIDIA H100 GPU.
Software Dependencies No The paper mentions using the Adam optimizer and cosine learning rate scheduler, and refers to models like YOLOv11 and Retina Face-ResNet50, but does not provide specific version numbers for software libraries or frameworks like Python, PyTorch, or TensorFlow.
Experiment Setup Yes All models are trained for 25 epochs using the Adam optimizer and a cosine learning rate scheduler with an initial rate of 1e-3 and batch size 32, followed by an additional 5 epochs with a reduced learning rate of 1e-5. All models in our study generate gesture target masks with a resolution of 128 128. We freeze the visual encoder during training, using its default image and patch sizes. For the gesture-gaze fusion module, we employ three Transformer layers featuring joint cross-attention, each configured with 16 attention heads and a 1024-dimensional MLP. To enhance model robustness and generalization, we apply diverse augmentation techniques during training, including head/body bounding box jittering, color jittering, random resizing and cropping, random horizontal flipping, random rotations, and random masking of scene patches. We train our model using a joint multitask objective that combines pixelwise binary cross-entropy loss for gesture target segmentation and focal loss for gesture existence prediction. The ground truth for the segmentation task is a binary mask Ymask RH W , while the gesture existence is supervised with a binary label Yexist. We use focal loss to address class imbalance in the existence prediction task. The overall training loss is defined as: Ltotal = (1 β) LBCE(Ymask, ˆYmask) + β Lfocal(Yexist, ˆYexist).