Off-Policy Average Reward Actor-Critic with Deterministic Policy Search

Authors: Naman Saxena, Subhojyoti Khastagir, Shishir Kolathaya, Shalabh Bhatnagar

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare the average reward performance of our proposed ARO-DDPG algorithm and observe better empirical performance compared to stateof-the-art on-policy average reward actor-critic algorithms over Mu Jo Co-based environments.
Researcher Affiliation Academia 1Department of Computer Science and Automation, Indian Institute of Science, Bangalore, India 2Robert Bosch Centre for Cyber-Physical Systems, Indian Institute of Science, Bangalore, India. Correspondence to: Naman Saxena <namansaxena@iisc.ac.in>.
Pseudocode Yes Algorithm 1 (Off-Policy) ARO-DDPG Practical Algorithm; Algorithm 2 On-policy AR-DPG with Linear FA; Algorithm 3 Off-policy AR-DPG with Linear FA; Algorithm 4 On-policy AR-DPG with Linear FA; Algorithm 5 Off-policy AR-DPG with Linear FA
Open Source Code Yes Pytorch implementation of ARO-DDPG could be found at this URL: https://github.com/namansaxena9/ARODDPG
Open Datasets Yes We conducted experiments on six different environments using the Deep Mind control suite (Tassa et al., 2018)
Dataset Splits No The paper discusses training and evaluation phases with specific episode lengths, but does not provide explicit dataset splits for training, validation, or testing, nor does it refer to standard predefined splits for the DeepMind Control Suite.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions 'Pytorch implementation' and 'Mu Jo Co-based environments' but does not specify version numbers for these or other software libraries, which is necessary for reproducibility.
Experiment Setup Yes The paper includes a 'Hyperparameter' table detailing specific values for Buffer Size, Total Environment Steps, Batch size, Evaluation Frequency, Training Episode Length, Evaluation Episode Length, Activation Function, Learning rate (Actor, Differential Q-value function, Average reward parameter), No. of Hidden Layers, No. of Nodes in Hidden Layer, Update frequency, No. of Critic updates, No. of Actor updates, and Polyak averaging constant.