A Comparative Analysis of Expected and Distributional Reinforcement Learning
Authors: Clare Lyle, Marc G. Bellemare, Pablo Samuel Castro4504-4511
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper we begin the investigation into this fundamental question by analyzing the differences in the tabular, linear approximation, and non-linear approximation settings. We prove that in many realizations of the tabular and linear approximation settings, distributional RL behaves exactly the same as expected RL. In cases where the two methods behave differently, distributional RL can in fact hurt performance when it does not induce identical behaviour. We then continue with an empirical analysis comparing distributional and expected RL methods in control settings with non-linear approximators to tease apart where the improvements from distributional RL methods are coming from. |
| Researcher Affiliation | Collaboration | 1University of Oxford (work done while at Google Brain) 2Google Brain |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statements or links regarding open-source code for the described methodology. |
| Open Datasets | No | The paper mentions 'Atari 2600 games', 'Cart Pole', 'Acrobot', '12x12 gridworld environment', and '3-state chain MDP', which are common environments. However, it does not provide specific access information (links, DOIs, repositories, or formal citations) for these datasets or environments. |
| Dataset Splits | No | The paper does not specify exact training, validation, or test dataset splits, percentages, or absolute sample counts. |
| Hardware Specification | No | The paper does not specify any particular hardware used for running the experiments (e.g., GPU/CPU models, memory, cloud platforms). |
| Software Dependencies | No | The paper mentions software components like 'DQN', 'C51', 'S51', 'Adam' (optimizer), and general programming aspects, but does not provide specific version numbers for any libraries, frameworks, or languages used. |
| Experiment Setup | Yes | We used the same hyperparameters for all algorithms, except for step sizes, where we chose the step size that gave the best performance for each algorithm. We otherwise use the usual agent infrastructure from DQN, including a replay memory of capacity 50,000 and a target network which is updated after every 10 training steps. We update the agent by sampling batches of 128 transitions from the replay memory. In the Cart Pole task we found that DQN often diverged with the gradient descent optimizer, so we used Adam for all the algorithms, and chose the learning rate parameter that gave the best performance for each. |