Active Reasoning in an Open-World Environment

Authors: Manjie Xu, Guangyuan Jiang, Wei Liang, Chi Zhang, Yixin Zhu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate state-of-the-art Reinforcement Learning (RL) and multimodal question-answering models on Conan. Our observations highlight an intriguing dichotomy: while these cutting-edge models exhibit prowess in addressing low-level, short-term tasks, they struggle with multi-round environmental interactions and high-level abductive reasoning.
Researcher Affiliation Academia Manjie Xu 1, : manjietsu@bit.edu.cn Guangyuan Jiang 2 jgy@stu.pku.edu.cn Wei Liang 1, 3, liangwei@bit.edu.cn Chi Zhang 4, zhangchi@bigai.ai Yixin Zhu 2, yixin.zhu@pku.edu.cn 1 School of Computer Science & Technology, Beijing Institute of Technology 2 Institute for AI, Peking University 3 Yangtze Delta Region Academy of Beijing Institute of Technology, Jiaxing, China 4 National Key Laboratory of General Artificial Intelligence, BIGAI
Pseudocode No The paper describes its methods textually and through diagrams (e.g., Figure 3 illustrating the detective pipeline), but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes https://sites.google.com/view/conan-active-reasoning
Open Datasets Yes Conan produced a corpus comprising 100,000 questions. These were derived from 10,000 unique scenes, generated via the Crafter’s scene generator, with each scene stemming from a task executed by a vandal. This resulted in an average generation of 10 questions per scene.
Dataset Splits Yes Table A4: Dataset split and choice distribution. Intent: 71162 (Train), 9152 (Test), 8822 (Val)
Hardware Specification Yes All models are trained utilizing 8 NVIDIA Ge Force RTX 3090 GPUs.
Software Dependencies No The paper mentions using "The Stable Baselines3 library" and models like "BERT-Large" and "De BERTA", but it does not specify exact version numbers for these software dependencies (e.g., "Stable Baselines3 vX.Y" or "PyTorch 1.9").
Experiment Setup Yes The explorer is trained using DQN, TRPO, and Recurrent PPO for 10^8 steps, with a buffer size of 10^7 and a batch size of 512. In the case of DQN, training is conducted with ϵ 0.96. Each episode is capped at a maximum of 500 steps for the explorer.