Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Beyond the Known: Decision Making with Counterfactual Reasoning Decision Transformer

Authors: Minh Hoang Nguyen, Linh Le Pham Van, Thommen George Karimpanal, Sunil Gupta, Hung Le

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across Atari and D4RL benchmarks, including scenarios with limited data and altered dynamics, demonstrate that CRDT outperforms conventional DT approaches.
Researcher Affiliation Academia 1Applied AI Initiative, Deakin University, Australia 2School of IT, Deakin University, Australia
Pseudocode No The paper describes its methodology in natural language and mathematical formulations (Section 4 'Methodology') but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Source code: https://github.com/mhngu23/Beyond-the-Known Decision-Making-with-Counterfactual1-Reasoning-Decision Transformer
Open Datasets Yes We conducted experiments on both continuous action space environments (Locomotion, Ant, and Maze2d [Fu et al., 2020]) and discrete action space environments (Atari [Bellemare et al., 2013])... The D4RL dataset serves as a standard benchmark for offline RL...
Dataset Splits Yes The results are over 5 seeds. For each seed, evaluation is conducted over 100 episodes. The X-axis represents the percentage of the dataset used in the experiment. ...Table 2: Performance comparison on Atari games (1% DQN-replay dataset).
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU models, or memory specifications.
Software Dependencies No The paper does not explicitly mention any specific software dependencies or their version numbers (e.g., Python, PyTorch, TensorFlow versions) that would be needed to replicate the experiment.
Experiment Setup Yes Results are averaged over 5 seeds, with evaluation conducted over 100 episodes per seed. ...We report the human-normalized scores over 3 seeds. For each seed, evaluation is conducted over 10 episodes. ...At each training step, we sample equal batches of trajectories from both the environment dataset Denv and the counterfactual experience buffer Dcrdt.