FIXMYPOSE: Pose Correctional Captioning and Retrieval
Authors: Hyounghun Kim, Abhay Zala, Graham Burri, Mohit Bansal13161-13170
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present strong cross-attention baseline models (uni/multimodal, RL, multilingual) and also show that our baselines are competitive with other models when evaluated on other image-difference datasets. We also propose new task-specific metrics (object-match, body-part-match, direction-match) and conduct human evaluation for more reliable evaluation, and we demonstrate a large human-model performance gap suggesting room for promising future work. Finally, to verify the sim-to-real transfer of our FIXMYPOSE dataset, we collect a set of real images and show promising performance on these images. |
| Researcher Affiliation | Academia | Hyounghun Kim*, Abhay Zala*, Graham Burri, Mohit Bansal Department of Computer Science University of North Carolina at Chapel Hill {hyounghk, aszala, ghburri, mbansal}@cs.unc.edu |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Data and code are available: https://fixmypose-unc.github.io. |
| Open Datasets | Yes | we introduce a new captioning dataset named FIXMYPOSE to address this need... Data and code are available: https://fixmypose-unc.github.io. |
| Dataset Splits | Yes | For the pose-correctional-captioning task, we split the dataset into train/val-seen/val-unseen/test-unseen following Anderson et al. (2018b). We assign separate rooms to val-unseen and test-unseen splits for evaluating model s ability to generalize to unseen environments. The number of task instances for each split is 5,973/562/563/593 (train/val-seen/val-unseen/test-unseen) and the number of descriptions is 5,973/1,686/1,689/1,779. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running experiments. |
| Software Dependencies | No | The paper mentions software components like Res Net and Adam as optimizers but does not provide specific version numbers for programming languages, libraries, or other software dependencies. |
| Experiment Setup | Yes | We use 512 / 256 as the hidden / word embedding size. We use Adam (Kingma and Ba 2015) as the optimizer. See Appendix in arxiv full version for details. |