reproducibilityindex.ai

Active Reasoning in an Open-World Environment

Authors: Manjie Xu, Guangyuan Jiang, Wei Liang, Chi Zhang, Yixin Zhu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate state-of-the-art Reinforcement Learning (RL) and multimodal question-answering models on Conan. Our observations highlight an intriguing dichotomy: while these cutting-edge models exhibit prowess in addressing low-level, short-term tasks, they struggle with multi-round environmental interactions and high-level abductive reasoning.
Researcher Affiliation	Academia	Manjie Xu 1, : manjietsu@bit.edu.cn Guangyuan Jiang 2 jgy@stu.pku.edu.cn Wei Liang 1, 3, liangwei@bit.edu.cn Chi Zhang 4, zhangchi@bigai.ai Yixin Zhu 2, yixin.zhu@pku.edu.cn 1 School of Computer Science & Technology, Beijing Institute of Technology 2 Institute for AI, Peking University 3 Yangtze Delta Region Academy of Beijing Institute of Technology, Jiaxing, China 4 National Key Laboratory of General Artificial Intelligence, BIGAI
Pseudocode	No	The paper describes its methods textually and through diagrams (e.g., Figure 3 illustrating the detective pipeline), but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	https://sites.google.com/view/conan-active-reasoning
Open Datasets	Yes	Conan produced a corpus comprising 100,000 questions. These were derived from 10,000 unique scenes, generated via the Crafter’s scene generator, with each scene stemming from a task executed by a vandal. This resulted in an average generation of 10 questions per scene.
Dataset Splits	Yes	Table A4: Dataset split and choice distribution. Intent: 71162 (Train), 9152 (Test), 8822 (Val)
Hardware Specification	Yes	All models are trained utilizing 8 NVIDIA Ge Force RTX 3090 GPUs.
Software Dependencies	No	The paper mentions using "The Stable Baselines3 library" and models like "BERT-Large" and "De BERTA", but it does not specify exact version numbers for these software dependencies (e.g., "Stable Baselines3 vX.Y" or "PyTorch 1.9").
Experiment Setup	Yes	The explorer is trained using DQN, TRPO, and Recurrent PPO for 10^8 steps, with a buffer size of 10^7 and a batch size of 512. In the case of DQN, training is conducted with ϵ 0.96. Each episode is capped at a maximum of 500 steps for the explorer.