Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Beyond the Known: Decision Making with Counterfactual Reasoning Decision Transformer
Authors: Minh Hoang Nguyen, Linh Le Pham Van, Thommen George Karimpanal, Sunil Gupta, Hung Le
IJCAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments across Atari and D4RL benchmarks, including scenarios with limited data and altered dynamics, demonstrate that CRDT outperforms conventional DT approaches. |
| Researcher Affiliation | Academia | 1Applied AI Initiative, Deakin University, Australia 2School of IT, Deakin University, Australia |
| Pseudocode | No | The paper describes its methodology in natural language and mathematical formulations (Section 4 'Methodology') but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Source code: https://github.com/mhngu23/Beyond-the-Known Decision-Making-with-Counterfactual1-Reasoning-Decision Transformer |
| Open Datasets | Yes | We conducted experiments on both continuous action space environments (Locomotion, Ant, and Maze2d [Fu et al., 2020]) and discrete action space environments (Atari [Bellemare et al., 2013])... The D4RL dataset serves as a standard benchmark for offline RL... |
| Dataset Splits | Yes | The results are over 5 seeds. For each seed, evaluation is conducted over 100 episodes. The X-axis represents the percentage of the dataset used in the experiment. ...Table 2: Performance comparison on Atari games (1% DQN-replay dataset). |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU models, or memory specifications. |
| Software Dependencies | No | The paper does not explicitly mention any specific software dependencies or their version numbers (e.g., Python, PyTorch, TensorFlow versions) that would be needed to replicate the experiment. |
| Experiment Setup | Yes | Results are averaged over 5 seeds, with evaluation conducted over 100 episodes per seed. ...We report the human-normalized scores over 3 seeds. For each seed, evaluation is conducted over 10 episodes. ...At each training step, we sample equal batches of trajectories from both the environment dataset Denv and the counterfactual experience buffer Dcrdt. |