Internally Rewarded Reinforcement Learning

Authors: Mengdi Li, Xufeng Zhao, Jae Hee Lee, Cornelius Weber, Stefan Wermter

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that the proposed reward function can consistently stabilize the training process by reducing the impact of reward noise, which leads to faster convergence and higher performance compared with baselines in diverse tasks.
Researcher Affiliation Academia 1Knowledge Technology Group, Department of Informatics, University of Hamburg, Hamburg, Germany.
Pseudocode No The paper describes algorithms and methods but does not include formal pseudocode blocks or figures labeled 'Algorithm'.
Open Source Code Yes Project page: https://ir-rl.github.io/
Open Datasets Yes We adopt the dataset configuration of Mnih et al. (2014), and use two basic models for this task: the recurrent attention model (RAM) (Mnih et al., 2014) and the dynamic-time recurrent attention model (DT-RAM) (Li et al., 2017). ... We use the same experimental setup and basic model on the four-room environment as in the work of the discriminator disagreement intrinsic reward (DISDAIN) (Strouse et al., 2022). ... The setup is based on the task of object existence prediction (Li et al., 2021).
Dataset Splits Yes We generate 60k Cluttered MNIST images, of which 90% are used for training and the rest for validation.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models or memory specifications used for the experiments.
Software Dependencies No The paper mentions software like Adam, PPO, and REINFORCE algorithms, and refers to code repositories for implementations, but does not provide specific version numbers for these software packages or any other dependencies (e.g., Python, PyTorch versions).
Experiment Setup Yes RAM models are trained using REINFORCE (Williams, 1992) and optimized by Adam (Kingma & Ba, 2015) for 1500 epochs with a batch size of 128 and a learning rate of 3e-4.