Counterfactual Credit Assignment in Model-Free Reinforcement Learning
Authors: Thomas Mesnard, Theophane Weber, Fabio Viola, Shantanu Thakoor, Alaa Saade, Anna Harutyunyan, Will Dabney, Thomas S Stepleton, Nicolas Heess, Arthur Guez, Eric Moulines, Marcus Hutter, Lars Buesing, Remi Munos
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4. Numerical experiments Given its guarantees on lower variance and unbiasedness, we run our experiments on the single action version of CCA-PG and leave the all-action version for future work. We first investigate a bandit with feedback task, then a task that requires short and long-term credit assignment (i.e. Keyto-Door), and finally an interleaved multi-task setup where each episode is composed of randomly sampled and interleaved tasks. All results for Key-to-Door and interleaved multi-task are reported as median performances over 10 seeds with quartiles represented by a shaded area. |
| Researcher Affiliation | Collaboration | 1Deep Mind 2INRIA XPOP, CMAP, École Polytechnique, Palaiseau, France. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statement or link indicating that the source code for the methodology is openly available. |
| Open Datasets | No | The paper describes custom environments used for experiments (e.g., 'Bandit with Feedback', 'Key-to-Door environments', 'Task Interleaving') but does not provide concrete access information (link, DOI, or specific citation with authors/year for dataset download) for them to be considered publicly available datasets. |
| Dataset Splits | No | The paper describes simulated environments and agent interactions within them, but does not provide specific details on training, validation, or test dataset splits (e.g., percentages or sample counts for static datasets) as would be typical for supervised learning tasks. Experiments are reported based on runs and seeds rather than explicit data splits. |
| Hardware Specification | No | The paper does not provide any specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | Our implementation uses a discounted factor of γ = 0.999. The agent receives observations from the environment through a 4-channel pixel representation (84x84). Both the agent network and the hindsight network use a 256-unit LSTM. We use Adam optimizer (Kingma & Ba, 2014) with a learning rate of 10−4. The critic (value network) has a learning rate of 10−3. We use a batch size of 32. For CCA-PG, the independence loss has a weight λIM = 100 for all experiments. The hindsight baseline loss has a weight λhs = 0.1 for all experiments. The hindsight predictor loss has a weight λsup = 10 for all experiments. |