Actor-Critic based Improper Reinforcement Learning
Authors: Mohammadi Zaki, Avi Mohan, Aditya Gopalan, Shie Mannor
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerical results on (i) the standard control theoretic benchmark of stabilizing an cartpole; and (ii) a constrained queueing task show that our improper policy optimization algorithm can stabilize the system even when the base policies at its disposal are unstable. and We corroborate our theory using extensive simulation studies. |
| Researcher Affiliation | Collaboration | 1Department of ECE, IISc, Bangalore, India 2Boston University, Massachusetts, USA 3Faculty of Electrical Engineering, Technion, Haifa, Israel and NVIDIA Research, Israel. |
| Pseudocode | Yes | Algorithm 1 Soft Max PG, Algorithm 2 Actor-Critic based Improper RL (ACIL), Algorithm 3 Critic-TD Subroutine, Algorithm 4 Projection-free Policy Gradient (for MABs), Algorithm 5 Softmax PG with Gradient Estimation (SPGE), Algorithm 6 Grad Est (subroutine for SPGE). |
| Open Source Code | No | The paper does not provide any statements about releasing source code, nor does it include links to a code repository for the described methodology. |
| Open Datasets | No | The paper refers to standard control benchmarks like the 'Cartpole system' and 'constrained queueing task' for its simulations, but it does not specify or provide access information (links, DOIs, or formal citations) for any publicly available datasets used for training. |
| Dataset Splits | No | The paper discusses simulation studies and algorithmic parameters like 'batch-sizes' but does not specify explicit training, validation, or test dataset splits. |
| Hardware Specification | No | The paper mentions 'extensive simulation studies' but does not provide any specific hardware details such as GPU or CPU models, or cloud infrastructure used for running these simulations. |
| Software Dependencies | No | The paper mentions 'Open AI gym' for Cartpole experiments but does not provide specific version numbers for any software dependencies, such as libraries, frameworks, or solvers. |
| Experiment Setup | Yes | In the simulations, we set learning rate to be 10 4, #runs = 10, #rollouts = 10, lt = 30, discount factor γ = 0.9 and α = 1/ #runs. All the simulations have been run for 20 trials and the results shown are averaged over them. We capped the queue sizes at 1000. and For the queuing theoretic simulations of Algorithm 2 ACIL, we choose α = 10 4, β = 10 3. We choose the identity mapping φ(s) s, where s is the current state of the system which is a N length vector, which consists of the ith queue length at the ith position. λ was chosen to be 0.1. The other parameters are chosen as B = 50, H = 30 and Tc = 20. |