FIXMYPOSE: Pose Correctional Captioning and Retrieval

Authors: Hyounghun Kim, Abhay Zala, Graham Burri, Mohit Bansal13161-13170

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present strong cross-attention baseline models (uni/multimodal, RL, multilingual) and also show that our baselines are competitive with other models when evaluated on other image-difference datasets. We also propose new task-specific metrics (object-match, body-part-match, direction-match) and conduct human evaluation for more reliable evaluation, and we demonstrate a large human-model performance gap suggesting room for promising future work. Finally, to verify the sim-to-real transfer of our FIXMYPOSE dataset, we collect a set of real images and show promising performance on these images.
Researcher Affiliation Academia Hyounghun Kim*, Abhay Zala*, Graham Burri, Mohit Bansal Department of Computer Science University of North Carolina at Chapel Hill {hyounghk, aszala, ghburri, mbansal}@cs.unc.edu
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Data and code are available: https://fixmypose-unc.github.io.
Open Datasets Yes we introduce a new captioning dataset named FIXMYPOSE to address this need... Data and code are available: https://fixmypose-unc.github.io.
Dataset Splits Yes For the pose-correctional-captioning task, we split the dataset into train/val-seen/val-unseen/test-unseen following Anderson et al. (2018b). We assign separate rooms to val-unseen and test-unseen splits for evaluating model s ability to generalize to unseen environments. The number of task instances for each split is 5,973/562/563/593 (train/val-seen/val-unseen/test-unseen) and the number of descriptions is 5,973/1,686/1,689/1,779.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies No The paper mentions software components like Res Net and Adam as optimizers but does not provide specific version numbers for programming languages, libraries, or other software dependencies.
Experiment Setup Yes We use 512 / 256 as the hidden / word embedding size. We use Adam (Kingma and Ba 2015) as the optimizer. See Appendix in arxiv full version for details.