reproducibilityindex.ai

A Comparative Analysis of Expected and Distributional Reinforcement Learning

Authors: Clare Lyle, Marc G. Bellemare, Pablo Samuel Castro4504-4511

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper we begin the investigation into this fundamental question by analyzing the differences in the tabular, linear approximation, and non-linear approximation settings. We prove that in many realizations of the tabular and linear approximation settings, distributional RL behaves exactly the same as expected RL. In cases where the two methods behave differently, distributional RL can in fact hurt performance when it does not induce identical behaviour. We then continue with an empirical analysis comparing distributional and expected RL methods in control settings with non-linear approximators to tease apart where the improvements from distributional RL methods are coming from.
Researcher Affiliation	Collaboration	1University of Oxford (work done while at Google Brain) 2Google Brain
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any statements or links regarding open-source code for the described methodology.
Open Datasets	No	The paper mentions 'Atari 2600 games', 'Cart Pole', 'Acrobot', '12x12 gridworld environment', and '3-state chain MDP', which are common environments. However, it does not provide specific access information (links, DOIs, repositories, or formal citations) for these datasets or environments.
Dataset Splits	No	The paper does not specify exact training, validation, or test dataset splits, percentages, or absolute sample counts.
Hardware Specification	No	The paper does not specify any particular hardware used for running the experiments (e.g., GPU/CPU models, memory, cloud platforms).
Software Dependencies	No	The paper mentions software components like 'DQN', 'C51', 'S51', 'Adam' (optimizer), and general programming aspects, but does not provide specific version numbers for any libraries, frameworks, or languages used.
Experiment Setup	Yes	We used the same hyperparameters for all algorithms, except for step sizes, where we chose the step size that gave the best performance for each algorithm. We otherwise use the usual agent infrastructure from DQN, including a replay memory of capacity 50,000 and a target network which is updated after every 10 training steps. We update the agent by sampling batches of 128 transitions from the replay memory. In the Cart Pole task we found that DQN often diverged with the gradient descent optimizer, so we used Adam for all the algorithms, and chose the learning rate parameter that gave the best performance for each.