RVI-SAC: Average Reward Off-Policy Deep Reinforcement Learning
Authors: Yukinari Hisaki, Isao Ono
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply our method to the Gymnasium s Mujoco tasks, a subset of locomotion tasks, and demonstrate that RVI-SAC shows competitive performance compared to existing methods. and 4. Experiment In our benchmark experiments, we aim to verify two aspects: (1) A comparison of the performance between RVI-SAC, SAC(Haarnoja et al., 2018b) with various discount rates, and the existing off-policy average reward DRL method, ARO-DDPG (Saxena et al., 2023). |
| Researcher Affiliation | Academia | Yukinari Hisaki 1 Isao Ono 1 1Tokyo Institute of Technology Yokohama, Kanagawa, Japan. Correspondence to: Yukinari Hisaki <hiskai.y@ic.c.titech.ac.jp>, Isao Ono <isao@c.titech.ac.jp>. |
| Pseudocode | Yes | Appendix B. Overall RVI-SAC algorithm and implementation and Algorithm 1 RVI-SAC |
| Open Source Code | Yes | The source code for this experiment can be found on our Git Hub repository at https://github.com/yhisaki/average-reward-drl. |
| Open Datasets | Yes | we conducted benchmark experiments using six tasks (Ant, Half Cheetah, Hopper, Walker2d, Humanoid, and Swimmer) implemented in the Gymnasium (Towers et al., 2023) and Mu Jo Co physical simulator (Todorov et al., 2012). |
| Dataset Splits | No | No specific dataset split information for training, validation, or testing was provided, only general mentions of 'training' and 'evaluation'. |
| Hardware Specification | No | No specific hardware details (GPU/CPU models, memory, or cloud instance types) used for running experiments were mentioned. |
| Software Dependencies | No | The paper mentions 'Gymnasium (Towers et al., 2023) and Mu Jo Co physical simulator (Todorov et al., 2012)' but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | Appendix E. Hyperparameter settings and Table 1. Hyperparameters of RVI-SAC and SAC. We summarize the hyperparameters used in RVI-SAC and SAC in Table 1. We used the same hyperparameters for ARODDPG as Saxena et al. (2023). [Table lists: Discount Factor γ, Optimizer, Learning Rate, Batch Size |B|, Replay Buffer Size |D|, Critic Network, Actor Network, Activation Function, Target Smoothing Coefficient τ, Entrpy Target H, Critc Network for Reset, Delayd f(Q) Update Parameter κ, Termination Frequency Target ϵreset] |