Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Inverse Reinforcement Learning with Natural Language Goals

Authors: Li Zhou, Kevin Small11116-11124

AAAI 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our algorithm outperforms multiple baselines by a large margin on a vision-based NL instruction following dataset (Room-2Room), demonstrating a promising advance in enabling the use of NL instructions in specifying agent goals. We evaluate our model on the Room-2-Room (R2R) dataset (Anderson et al. 2018), a visually-grounded NL navigation task in realistic 3D indoor environments. We evaluate the model performance based on the trajectory success rate. The performance of our algorithm and baselines are shown in Figure 1, Table 1, and Table 2.
Researcher Affiliation	Industry	Li Zhou, Kevin Small Amazon Alexa EMAIL
Pseudocode	Yes	Algorithm 1 Inverse Reinforcement Learning with Natural Language Goals (Lang Goal IRL)
Open Source Code	No	The paper does not provide any statements about releasing code or links to a code repository.
Open Datasets	Yes	We evaluate our model on the Room-2-Room (R2R) dataset (Anderson et al. 2018), a visually-grounded NL navigation task in realistic 3D indoor environments. The dataset contains 7,189 routes sampled from 90 real world indoor environments.
Dataset Splits	Yes	The dataset is split into train (61 environments and 14,025 instructions), seen validation (61 environments same as train set, and 1,020 instructions), unseen validation (11 new environments and 2,349 instructions), and test (18 new environments and 4,173 instructions).
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments (e.g., CPU/GPU models, memory).
Software Dependencies	No	The paper mentions using Soft Actor-Critic (SAC) and various network components like MLP, LSTM, and attention mechanisms, but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup	No	The main text states: "Appendix B contains details about model architecture and optimization." and "For implementation details of our algorithms and the baselines, please refer to Appendix B." Since Appendix B is not part of the provided text, the specific experimental setup details are not present in the main content.