Understanding Embodied Reference with Touch-Line Transformer
Authors: Yang Li, Xiaoxue Chen, Hao Zhao, Jiangtao Gong, Guyue Zhou, Federico Rossano, Yixin Zhu
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on the You Ref It dataset demonstrate that our method yields a +25.0% accuracy improvement under the 0.75 Io U criterion, hence closing 63.6% of the performance difference between models and humans. |
| Researcher Affiliation | Collaboration | 1 Institute for AI Industry Research, Tsinghua University 2 Department of Cognitive Science, UCSD 3 Institute for Artificial Intelligence, Peking University. This work is supported in part by the National Key R&D Program of China (2022ZD0114900), the Beijing Nova Program, and Beijing Didi Chuxing Technology Co., Ltd. |
| Pseudocode | No | The paper describes the model architecture and losses in text, but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://yang-li-2000.github.io/Touch-Line-Transformer |
| Open Datasets | Yes | We use the You Ref It dataset (Chen et al., 2021) with 2, 950 training instances and 1, 245 test instances. |
| Dataset Splits | Yes | We use the You Ref It dataset (Chen et al., 2021) with 2, 950 training instances and 1, 245 test instances. |
| Hardware Specification | Yes | We use NVIDIA A100 GPUs. |
| Software Dependencies | No | We generate a visual embedding vector by extracting visual features from input images with a pre-trained Res Net (He et al., 2016) backbone... Meanwhile, we generate a textural embedding vector from input texts using a pre-trained BERT (Liu et al., 2019). Specifically, we first use Mask R-CNN X-101 (Massa & Girshick, 2018) to produce human masks. After that, we expand the human masks created by the mask rcnn to both the left and right sides to cover the edge of humans entirely. We ensure that the expanded mask never encroaches on regions occupied by the ground truth bounding box for the referent. After that, we feed the expanded masks into MAT (Li et al., 2022). (No specific version numbers for software dependencies are provided.) |
| Experiment Setup | Yes | During training, we use the Adam variant, AMSGrad (Reddi et al., 2018), and train our models for 200 epochs. We set the learning rate to 5e-5 except for the text encoder, whose learning rate is 1e-4. The sum of the batch sizes on all graphic cards is 56. The total number of queries is 20; 15 for objects, and 5 for gestural key points. |