RVI-SAC: Average Reward Off-Policy Deep Reinforcement Learning

Authors: Yukinari Hisaki, Isao Ono

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We apply our method to the Gymnasium s Mujoco tasks, a subset of locomotion tasks, and demonstrate that RVI-SAC shows competitive performance compared to existing methods. and 4. Experiment In our benchmark experiments, we aim to verify two aspects: (1) A comparison of the performance between RVI-SAC, SAC(Haarnoja et al., 2018b) with various discount rates, and the existing off-policy average reward DRL method, ARO-DDPG (Saxena et al., 2023).
Researcher Affiliation Academia Yukinari Hisaki 1 Isao Ono 1 1Tokyo Institute of Technology Yokohama, Kanagawa, Japan. Correspondence to: Yukinari Hisaki <hiskai.y@ic.c.titech.ac.jp>, Isao Ono <isao@c.titech.ac.jp>.
Pseudocode Yes Appendix B. Overall RVI-SAC algorithm and implementation and Algorithm 1 RVI-SAC
Open Source Code Yes The source code for this experiment can be found on our Git Hub repository at https://github.com/yhisaki/average-reward-drl.
Open Datasets Yes we conducted benchmark experiments using six tasks (Ant, Half Cheetah, Hopper, Walker2d, Humanoid, and Swimmer) implemented in the Gymnasium (Towers et al., 2023) and Mu Jo Co physical simulator (Todorov et al., 2012).
Dataset Splits No No specific dataset split information for training, validation, or testing was provided, only general mentions of 'training' and 'evaluation'.
Hardware Specification No No specific hardware details (GPU/CPU models, memory, or cloud instance types) used for running experiments were mentioned.
Software Dependencies No The paper mentions 'Gymnasium (Towers et al., 2023) and Mu Jo Co physical simulator (Todorov et al., 2012)' but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes Appendix E. Hyperparameter settings and Table 1. Hyperparameters of RVI-SAC and SAC. We summarize the hyperparameters used in RVI-SAC and SAC in Table 1. We used the same hyperparameters for ARODDPG as Saxena et al. (2023). [Table lists: Discount Factor γ, Optimizer, Learning Rate, Batch Size |B|, Replay Buffer Size |D|, Critic Network, Actor Network, Activation Function, Target Smoothing Coefficient τ, Entrpy Target H, Critc Network for Reset, Delayd f(Q) Update Parameter κ, Termination Frequency Target ϵreset]