RUDDER: Return Decomposition for Delayed Rewards
Authors: Jose A. Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, Johannes Brandstetter, Sepp Hochreiter
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On artificial tasks with delayed rewards, RUDDER is significantly faster than MC and exponentially faster than Monte Carlo Tree Search (MCTS), TD(λ), and reward shaping approaches. At Atari games, RUDDER on top of a Proximal Policy Optimization (PPO) baseline improves the scores, which is most prominent at games with delayed rewards. |
| Researcher Affiliation | Academia | Jose A. Arjona-Medina Michael Gillhofer Michael Widrich Thomas Unterthiner Johannes Brandstetter Sepp Hochreiter LIT AI Lab Institute for Machine Learning Johannes Kepler University Linz, Austria also at Institute of Advanced Research in Artificial Intelligence (IARAI) |
| Pseudocode | No | The paper describes the RUDDER algorithm in prose but does not include a formal pseudocode block or algorithm listing. |
| Open Source Code | Yes | Source code is available at https://github.com/ml-jku/rudder |
| Open Datasets | Yes | RUDDER is evaluated on three artificial tasks with delayed rewards. [...] Furthermore, we compare RUDDER with a Proximal Policy Optimization (PPO) baseline on 52 Atari games of the Arcade Learning Environment (ALE) [9] and Open AI Gym [13]. |
| Dataset Splits | No | The paper does not specify exact training, validation, and test split percentages or sample counts for any of the datasets/environments used. It mentions 'Training episodes' and 'averaged over 3 different random seeds'. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers (e.g., Python, TensorFlow, PyTorch versions). |
| Experiment Setup | No | The paper mentions 'A coarse hyperparameter optimization is performed for the PPO baseline' and 'RUDDER uses the same architectures, losses, and hyperparameters, which were optimized for the baseline.' It also states 'Training episodes end with losing a life or at maximal 108K frames'. However, it does not explicitly list the specific values of hyperparameters (e.g., learning rate, batch size, network architecture details) or provide comprehensive system-level training settings. |