Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Rejecting Hallucinated State Targets during Planning

Authors: Harry Zhao, Tristan Sylvain, Romain Laroche, Doina Precup, Yoshua Bengio

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our controlled experiments show significant reductions in delusional behaviors and performance improvements for various kinds of existing agents.
Researcher Affiliation Collaboration 1Mila (Quebec AI Institute) 2Mc Gill University 3Wayve 4RBC Borealis 5Google Deepmind 6Universit e de Montr eal.
Pseudocode No The paper describes methods using mathematical equations and textual explanations, but it does not include any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code Yes The results presented in the experiments are fullyreproducible with the source code published at https: //github.com/mila-iqia/delusions.
Open Datasets Yes To provide intuition about these concepts, we use the Mini Grid platform to create a set of fully-observable environments, minimizing extraneous factors to focus on the targets (Chevalier-Boisvert et al., 2023). We call this environment Sword Shield Monster (SSM for short)... The second environment employed is Rand Dist Shift, abbreviated as RDS. RDS was originally proposed in Zhao et al. (2021) as a variant of the counterparts in the Mini Grid Baby-AI platform (Chevalier-Boisvert et al., 2023), and then later used as the experimental backbone in Zhao et al. (2024).
Dataset Splits Yes For each seed run on SSM, we sample and preserve 50 training tasks of size 12 12 and difficulty δ = 0.4. For each episode, one of the 50 tasks is sampled for training. Agents are trained for 1.5 106 interactions in total. ... The evaluation tasks (targeting systematic generalization) are sampled from a gradient of OOD difficulties 0.25, 0.35, 0.45 and 0.55. ... by sampling 20 task instances from each of the 4 OOD difficulties, and combine the performance across all 80 episodes, which have a mean difficulty matching the training tasks.
Hardware Specification No The paper mentions 'computational resources' from Mila and McGill University, and notes 'Limited computational resources prevented our extended experiments' in the context of DreamerV2, but it does not provide specific hardware details like GPU/CPU models or memory specifications.
Software Dependencies No The paper mentions that the evaluator is 'implemented in Py Torch', but it does not specify the version number of PyTorch or any other software dependencies.
Experiment Setup Yes we use a simple and unified implementation of our evaluator (3-layers of Re LU activated MLP with output bin T = 16 and (E+P+G)) for 8 sets of experiments... For each seed run on SSM, we sample and preserve 50 training tasks of size 12 12 and difficulty δ = 0.4. For each episode, one of the 50 tasks is sampled for training. Agents are trained for 1.5 106 interactions in total. ... We increased the base population size of each generation to 512 and lengthened the number of iterations to 10. ... The threshold for 1-feasibility based rejections are set to be 0.05...