reproducibilityindex.ai

Counterfactual Credit Assignment in Model-Free Reinforcement Learning

Authors: Thomas Mesnard, Theophane Weber, Fabio Viola, Shantanu Thakoor, Alaa Saade, Anna Harutyunyan, Will Dabney, Thomas S Stepleton, Nicolas Heess, Arthur Guez, Eric Moulines, Marcus Hutter, Lars Buesing, Remi Munos

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4. Numerical experiments Given its guarantees on lower variance and unbiasedness, we run our experiments on the single action version of CCA-PG and leave the all-action version for future work. We ﬁrst investigate a bandit with feedback task, then a task that requires short and long-term credit assignment (i.e. Keyto-Door), and ﬁnally an interleaved multi-task setup where each episode is composed of randomly sampled and interleaved tasks. All results for Key-to-Door and interleaved multi-task are reported as median performances over 10 seeds with quartiles represented by a shaded area.
Researcher Affiliation	Collaboration	1Deep Mind 2INRIA XPOP, CMAP, École Polytechnique, Palaiseau, France.
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statement or link indicating that the source code for the methodology is openly available.
Open Datasets	No	The paper describes custom environments used for experiments (e.g., 'Bandit with Feedback', 'Key-to-Door environments', 'Task Interleaving') but does not provide concrete access information (link, DOI, or specific citation with authors/year for dataset download) for them to be considered publicly available datasets.
Dataset Splits	No	The paper describes simulated environments and agent interactions within them, but does not provide specific details on training, validation, or test dataset splits (e.g., percentages or sample counts for static datasets) as would be typical for supervised learning tasks. Experiments are reported based on runs and seeds rather than explicit data splits.
Hardware Specification	No	The paper does not provide any specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup	Yes	Our implementation uses a discounted factor of γ = 0.999. The agent receives observations from the environment through a 4-channel pixel representation (84x84). Both the agent network and the hindsight network use a 256-unit LSTM. We use Adam optimizer (Kingma & Ba, 2014) with a learning rate of 10−4. The critic (value network) has a learning rate of 10−3. We use a batch size of 32. For CCA-PG, the independence loss has a weight λIM = 100 for all experiments. The hindsight baseline loss has a weight λhs = 0.1 for all experiments. The hindsight predictor loss has a weight λsup = 10 for all experiments.