Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Confounding Robust Deep Reinforcement Learning: A Causal Approach

Authors: Mingxuan Li, Junzhe Zhang, Elias Bareinboim

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply our method to twelve confounded Atari games, and find that it consistently dominates the standard DQN in all games where the observed input to the behavioral and target policies mismatch and unobserved confounders exist. In this section, we aim to demonstrate the robustness and performance improvement of our proposed Causal-DQN under confounded settings. For a comprehensive evaluation of Causal-DQN, we choose twelve popular Atari games from the Gymnasium benchmark [100] and design the corresponding confounded versions.
Researcher Affiliation	Academia	Mingxuan Li1 Junzhe Zhang2 Elias Bareinboim1 1 Columbia University, 2 Syracuse University 1EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Causal Deep Q-Learning (Causal-DQN) Algorithm 2 Causal Deep Q-Learning (Causal-DQN)
Open Source Code	No	NeurIPS Paper Checklist, Question 5: 'Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?' Answer: [Yes]. Justification: 'Our experiments are based on a open benchmark (Atari from Gymnasium environments). And the detailed parameters and setups for generating confounded Atari games are also reported in Sec. 4. The demonstrator generating the confounded data is also an open sourced model, see https://github.com/eloialonso/diamond.' This justification refers to the open-source nature of the components used (benchmark, demonstrator model) rather than providing a clear statement or link for the authors' own implementation code of Causal-DQN.
Open Datasets	Yes	For a comprehensive evaluation of Causal-DQN, we choose twelve popular Atari games from the Gymnasium benchmark [100] and design the corresponding confounded versions.
Dataset Splits	Yes	For each game, we train the agent for 1 million environment steps. We use 20 parallel environments to collect samples. At each parallel environment step, a minibatch is sampled to train the agents, equivalent to an update frequency of 20. We use a batch size of 512, a replay buffer of 100K in size, and a learning rate of 5e 4 to accelerate convergence. Other hyperparameters are the same as in [61]. All results presented in this section are evaluation performances where we test each trained agent in the Atari game with masked observations. Curves in Fig. 8 are generated by evaluating the agent periodically in a separate evaluation environment, not from training returns.
Hardware Specification	Yes	To train our model, we use an H100 GPU. On average, for each game and each seed, it takes around 2 hrs and a RAM space of less than 2 GB for using the diamond demonstrator [2]. While it takes up to 8 hrs for using the sebulba demonstrator [30] from Clean RL [33].
Software Dependencies	No	The paper mentions 'Clean RL package [33]' but does not provide a specific version number for this package or any other key software libraries (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	For each game, we train the agent for 1 million environment steps. We use 20 parallel environments to collect samples. At each parallel environment step, a minibatch is sampled to train the agents, equivalent to an update frequency of 20. We use a batch size of 512, a replay buffer of 100K in size, and a learning rate of 5e 4 to accelerate convergence. Other hyperparameters are the same as in [61]. For all CNN based DQN networks in this work, we adopt the nature DQN architecture introduced by Mnih et al. [61]. The network comprises three convolutional layers followed by two fully connected layers, outputting Q-values for each discrete action. While for LSTM based ones, we only replace the second to last linear layer in nature DQN with an LSTM cell. For both the linear layers and lstm cells, we use a hidden dimension of 512. To mitigate overestimation bias in Q-learning, we further incorporate the Double DQN modification [101]... We also use epsilon greedy for exploration as the standard DQN algorithm. Every Ttarget steps, update θ θ.