Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Beyond the Known: Decision Making with Counterfactual Reasoning Decision Transformer

Authors: Minh Hoang Nguyen, Linh Le Pham Van, Thommen George Karimpanal, Sunil Gupta, Hung Le

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments across Atari and D4RL benchmarks, including scenarios with limited data and altered dynamics, demonstrate that CRDT outperforms conventional DT approaches.
Researcher Affiliation	Academia	1Applied AI Initiative, Deakin University, Australia 2School of IT, Deakin University, Australia
Pseudocode	No	The paper describes its methodology in natural language and mathematical formulations (Section 4 'Methodology') but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Source code: https://github.com/mhngu23/Beyond-the-Known Decision-Making-with-Counterfactual1-Reasoning-Decision Transformer
Open Datasets	Yes	We conducted experiments on both continuous action space environments (Locomotion, Ant, and Maze2d [Fu et al., 2020]) and discrete action space environments (Atari [Bellemare et al., 2013])... The D4RL dataset serves as a standard benchmark for offline RL...
Dataset Splits	Yes	The results are over 5 seeds. For each seed, evaluation is conducted over 100 episodes. The X-axis represents the percentage of the dataset used in the experiment. ...Table 2: Performance comparison on Atari games (1% DQN-replay dataset).
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU models, or memory specifications.
Software Dependencies	No	The paper does not explicitly mention any specific software dependencies or their version numbers (e.g., Python, PyTorch, TensorFlow versions) that would be needed to replicate the experiment.
Experiment Setup	Yes	Results are averaged over 5 seeds, with evaluation conducted over 100 episodes per seed. ...We report the human-normalized scores over 3 seeds. For each seed, evaluation is conducted over 10 episodes. ...At each training step, we sample equal batches of trajectories from both the environment dataset Denv and the counterfactual experience buffer Dcrdt.