Understanding Embodied Reference with Touch-Line Transformer

Authors: Yang Li, Xiaoxue Chen, Hao Zhao, Jiangtao Gong, Guyue Zhou, Federico Rossano, Yixin Zhu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on the You Ref It dataset demonstrate that our method yields a +25.0% accuracy improvement under the 0.75 Io U criterion, hence closing 63.6% of the performance difference between models and humans.
Researcher Affiliation Collaboration 1 Institute for AI Industry Research, Tsinghua University 2 Department of Cognitive Science, UCSD 3 Institute for Artificial Intelligence, Peking University. This work is supported in part by the National Key R&D Program of China (2022ZD0114900), the Beijing Nova Program, and Beijing Didi Chuxing Technology Co., Ltd.
Pseudocode No The paper describes the model architecture and losses in text, but does not provide structured pseudocode or algorithm blocks.
Open Source Code Yes https://yang-li-2000.github.io/Touch-Line-Transformer
Open Datasets Yes We use the You Ref It dataset (Chen et al., 2021) with 2, 950 training instances and 1, 245 test instances.
Dataset Splits Yes We use the You Ref It dataset (Chen et al., 2021) with 2, 950 training instances and 1, 245 test instances.
Hardware Specification Yes We use NVIDIA A100 GPUs.
Software Dependencies No We generate a visual embedding vector by extracting visual features from input images with a pre-trained Res Net (He et al., 2016) backbone... Meanwhile, we generate a textural embedding vector from input texts using a pre-trained BERT (Liu et al., 2019). Specifically, we first use Mask R-CNN X-101 (Massa & Girshick, 2018) to produce human masks. After that, we expand the human masks created by the mask rcnn to both the left and right sides to cover the edge of humans entirely. We ensure that the expanded mask never encroaches on regions occupied by the ground truth bounding box for the referent. After that, we feed the expanded masks into MAT (Li et al., 2022). (No specific version numbers for software dependencies are provided.)
Experiment Setup Yes During training, we use the Adam variant, AMSGrad (Reddi et al., 2018), and train our models for 200 epochs. We set the learning rate to 5e-5 except for the text encoder, whose learning rate is 1e-4. The sum of the batch sizes on all graphic cards is 56. The total number of queries is 20; 15 for objects, and 5 for gestural key points.