Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mitigating Value Hallucination in Dyna-Style Planning via Multistep Predecessor Models

Authors: Farzane Aminmansour, Taher Jafferjee, Ehsan Imani, Erin J. Talvitie, Michael Bowling, Martha White

JAIR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experimental results provide evidence for the HVH, and suggest that using predecessor models with multi-step updates is a promising direction toward developing Dyna algorithms that are more robust to model error. We introduce an environment to test the hypothesis and show that previous variants of Dyna fail when the model is imperfect whereas our algorithm does not. We further test the algorithms on three classic benchmark environments and find even in these environments the same behavior persists.
Researcher Affiliation Academia Farzane Aminmansour EMAIL Taher Jafferjee EMAIL Ehsan Imani EMAIL Dept of Computing Science & the Alberta Machine Intelligence Inst University of Alberta, Canada Erin J. Talvitie EMAIL Dept of Computer Science, Harvey Mudd College, USA Michael Bowling EMAIL Martha White EMAIL Dept of Computing Science & Amii University of Alberta, Canada
Pseudocode Yes Algorithm 1 Original Dyna-Q ... Algorithm 2 Prioritised-Dyna with Multi-step Updates ... Algorithm 3 Planning Update ... Algorithm 4 Pop Tuple and Screen ... Algorithm 5 Is On Policy
Open Source Code No The paper mentions "Pygame Learning Environment. https://github.com/ntasfi/Py Game-Learning-Environment." but this is a third-party tool used for experiments, not the authors' own source code for their methodology. There is no explicit statement or link provided by the authors for their own code release.
Open Datasets Yes Our experiments were conducted on three benchmarks: Cartpole (Brockman, Cheung, Pettersson, Schneider, Schulman, Tang, & Zaremba, 2016), Puddleworld (Degris, White, & Sutton, 2012), and Catcher (Tasfi, 2016).
Dataset Splits No To learn the offline model, following the method of (Oh, Guo, Lee, Lewis, & Singh, 2015), we collected 100, 000 training samples by executing a pre-trained agent on the environment with ϵ = 0.5. This describes data collection for the model, but not explicit train/test/validation splits for the main experiments or model evaluation.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU, GPU models, or memory specifications) used for running its experiments.
Software Dependencies No The paper mentions using Q-learning and DQN algorithms, as well as the Pygame Learning Environment, but it does not specify any version numbers for these or other software libraries/dependencies.
Experiment Setup Yes The value of α has been selected by sweeping over a set of {0.1, 0.25, 0.5, 0.75, 0.05, 0.125} and β by sweeping over {0.0, 0.15, 0.33, 0.50, 0.66, 0.75, 0.90, 1.0}. All agents use N = 1 planning updates per step, where each planning update iterates over all actions. We trained a network with 200 hidden units to convergence using the DQN algorithm and froze its weights. We initialised weights of the linear learner using samples from N(0, 1).