Adaptive Rational Activations to Boost Deep Reinforcement Learning
Authors: Quentin Delfosse, Patrick Schramowski, Martin Mundt, Alejandro Molina, Kristian Kersting
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that equipping popular algorithms with (joint) rational activations leads to consistent improvements on different games from the Atari Learning Environment benchmark, notably making DQN competitive to DDQN and Rainbow. and We empirically demonstrate that rational activations bring significant improvements to DQN and Rainbow algorithms on Atari games and that our joint variant further increases performance. |
| Researcher Affiliation | Academia | 1 Computer Science Dept., TU Darmstadt 2 German Center for Artificial Intelligence 3 Hessian Center for Artifical Intelligence 4 Centre for Cognitive Science, Darmstadt |
| Pseudocode | No | The paper describes methodologies and provides mathematical derivations, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Rational library: github.com/k4ntz/activation-functions; Experiments: github.com/ml-research/rational_rl. |
| Open Datasets | Yes | 15 different games of the Atari 2600 domain (Brockman et al., 2017). and For the classification experiments, we run on CIFAR10 and CIFAR100 (Krizhevsky et al., MIT License) |
| Dataset Splits | No | The paper evaluates agents at steps, but does not explicitly describe a validation dataset split (e.g., percentages or counts for training, validation, and test sets). |
| Hardware Specification | Yes | 230.000 GPU hours, carried out on a DGX-2 Machine with Nvidia Tesla V100 with 32GB. and A single run took more than 40 days on an NVIDIA Tesla V100 GPU. |
| Software Dependencies | No | The paper mentions Mushroom RL library, Arcade Learning Environment, CUDA optimized implementation, and Adam optimiser, but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | The target network is updated every 10K steps, with a replay buffer memory of initial size 50K, and maximum size 500K, except for Pong, for which all these values are divided by 10. The discount factor γ is set to 0.99 and the learning rate is 0.00025. We do not select the best policy among seeds between epochs. We use the simple ϵ-greedy exploration policy, with the ϵ decreasing linearly from 1 to 0.1 over 1M steps, and an ϵ of 0.05 is used for testing. |