reproducibilityindex.ai

RUDDER: Return Decomposition for Delayed Rewards

Authors: Jose A. Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, Johannes Brandstetter, Sepp Hochreiter

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On artiﬁcial tasks with delayed rewards, RUDDER is signiﬁcantly faster than MC and exponentially faster than Monte Carlo Tree Search (MCTS), TD(λ), and reward shaping approaches. At Atari games, RUDDER on top of a Proximal Policy Optimization (PPO) baseline improves the scores, which is most prominent at games with delayed rewards.
Researcher Affiliation	Academia	Jose A. Arjona-Medina Michael Gillhofer Michael Widrich Thomas Unterthiner Johannes Brandstetter Sepp Hochreiter LIT AI Lab Institute for Machine Learning Johannes Kepler University Linz, Austria also at Institute of Advanced Research in Artiﬁcial Intelligence (IARAI)
Pseudocode	No	The paper describes the RUDDER algorithm in prose but does not include a formal pseudocode block or algorithm listing.
Open Source Code	Yes	Source code is available at https://github.com/ml-jku/rudder
Open Datasets	Yes	RUDDER is evaluated on three artiﬁcial tasks with delayed rewards. [...] Furthermore, we compare RUDDER with a Proximal Policy Optimization (PPO) baseline on 52 Atari games of the Arcade Learning Environment (ALE) [9] and Open AI Gym [13].
Dataset Splits	No	The paper does not specify exact training, validation, and test split percentages or sample counts for any of the datasets/environments used. It mentions 'Training episodes' and 'averaged over 3 different random seeds'.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments.
Software Dependencies	No	The paper does not specify any software dependencies with version numbers (e.g., Python, TensorFlow, PyTorch versions).
Experiment Setup	No	The paper mentions 'A coarse hyperparameter optimization is performed for the PPO baseline' and 'RUDDER uses the same architectures, losses, and hyperparameters, which were optimized for the baseline.' It also states 'Training episodes end with losing a life or at maximal 108K frames'. However, it does not explicitly list the specific values of hyperparameters (e.g., learning rate, batch size, network architecture details) or provide comprehensive system-level training settings.