Off-Policy Average Reward Actor-Critic with Deterministic Policy Search
Authors: Naman Saxena, Subhojyoti Khastagir, Shishir Kolathaya, Shalabh Bhatnagar
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare the average reward performance of our proposed ARO-DDPG algorithm and observe better empirical performance compared to stateof-the-art on-policy average reward actor-critic algorithms over Mu Jo Co-based environments. |
| Researcher Affiliation | Academia | 1Department of Computer Science and Automation, Indian Institute of Science, Bangalore, India 2Robert Bosch Centre for Cyber-Physical Systems, Indian Institute of Science, Bangalore, India. Correspondence to: Naman Saxena <namansaxena@iisc.ac.in>. |
| Pseudocode | Yes | Algorithm 1 (Off-Policy) ARO-DDPG Practical Algorithm; Algorithm 2 On-policy AR-DPG with Linear FA; Algorithm 3 Off-policy AR-DPG with Linear FA; Algorithm 4 On-policy AR-DPG with Linear FA; Algorithm 5 Off-policy AR-DPG with Linear FA |
| Open Source Code | Yes | Pytorch implementation of ARO-DDPG could be found at this URL: https://github.com/namansaxena9/ARODDPG |
| Open Datasets | Yes | We conducted experiments on six different environments using the Deep Mind control suite (Tassa et al., 2018) |
| Dataset Splits | No | The paper discusses training and evaluation phases with specific episode lengths, but does not provide explicit dataset splits for training, validation, or testing, nor does it refer to standard predefined splits for the DeepMind Control Suite. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions 'Pytorch implementation' and 'Mu Jo Co-based environments' but does not specify version numbers for these or other software libraries, which is necessary for reproducibility. |
| Experiment Setup | Yes | The paper includes a 'Hyperparameter' table detailing specific values for Buffer Size, Total Environment Steps, Batch size, Evaluation Frequency, Training Episode Length, Evaluation Episode Length, Activation Function, Learning rate (Actor, Differential Q-value function, Average reward parameter), No. of Hidden Layers, No. of Nodes in Hidden Layer, Update frequency, No. of Critic updates, No. of Actor updates, and Polyak averaging constant. |