Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

FIXMYPOSE: Pose Correctional Captioning and Retrieval

Authors: Hyounghun Kim, Abhay Zala, Graham Burri, Mohit Bansal13161-13170

AAAI 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present strong cross-attention baseline models (uni/multimodal, RL, multilingual) and also show that our baselines are competitive with other models when evaluated on other image-difference datasets. We also propose new task-speciﬁc metrics (object-match, body-part-match, direction-match) and conduct human evaluation for more reliable evaluation, and we demonstrate a large human-model performance gap suggesting room for promising future work. Finally, to verify the sim-to-real transfer of our FIXMYPOSE dataset, we collect a set of real images and show promising performance on these images.
Researcher Affiliation	Academia	Hyounghun Kim, Abhay Zala, Graham Burri, Mohit Bansal Department of Computer Science University of North Carolina at Chapel Hill EMAIL
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Data and code are available: https://ﬁxmypose-unc.github.io.
Open Datasets	Yes	we introduce a new captioning dataset named FIXMYPOSE to address this need... Data and code are available: https://ﬁxmypose-unc.github.io.
Dataset Splits	Yes	For the pose-correctional-captioning task, we split the dataset into train/val-seen/val-unseen/test-unseen following Anderson et al. (2018b). We assign separate rooms to val-unseen and test-unseen splits for evaluating model s ability to generalize to unseen environments. The number of task instances for each split is 5,973/562/563/593 (train/val-seen/val-unseen/test-unseen) and the number of descriptions is 5,973/1,686/1,689/1,779.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies	No	The paper mentions software components like Res Net and Adam as optimizers but does not provide specific version numbers for programming languages, libraries, or other software dependencies.
Experiment Setup	Yes	We use 512 / 256 as the hidden / word embedding size. We use Adam (Kingma and Ba 2015) as the optimizer. See Appendix in arxiv full version for details.