reproducibilityindex.ai

Actor-Critic based Improper Reinforcement Learning

Authors: Mohammadi Zaki, Avi Mohan, Aditya Gopalan, Shie Mannor

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Numerical results on (i) the standard control theoretic benchmark of stabilizing an cartpole; and (ii) a constrained queueing task show that our improper policy optimization algorithm can stabilize the system even when the base policies at its disposal are unstable. and We corroborate our theory using extensive simulation studies.
Researcher Affiliation	Collaboration	1Department of ECE, IISc, Bangalore, India 2Boston University, Massachusetts, USA 3Faculty of Electrical Engineering, Technion, Haifa, Israel and NVIDIA Research, Israel.
Pseudocode	Yes	Algorithm 1 Soft Max PG, Algorithm 2 Actor-Critic based Improper RL (ACIL), Algorithm 3 Critic-TD Subroutine, Algorithm 4 Projection-free Policy Gradient (for MABs), Algorithm 5 Softmax PG with Gradient Estimation (SPGE), Algorithm 6 Grad Est (subroutine for SPGE).
Open Source Code	No	The paper does not provide any statements about releasing source code, nor does it include links to a code repository for the described methodology.
Open Datasets	No	The paper refers to standard control benchmarks like the 'Cartpole system' and 'constrained queueing task' for its simulations, but it does not specify or provide access information (links, DOIs, or formal citations) for any publicly available datasets used for training.
Dataset Splits	No	The paper discusses simulation studies and algorithmic parameters like 'batch-sizes' but does not specify explicit training, validation, or test dataset splits.
Hardware Specification	No	The paper mentions 'extensive simulation studies' but does not provide any specific hardware details such as GPU or CPU models, or cloud infrastructure used for running these simulations.
Software Dependencies	No	The paper mentions 'Open AI gym' for Cartpole experiments but does not provide specific version numbers for any software dependencies, such as libraries, frameworks, or solvers.
Experiment Setup	Yes	In the simulations, we set learning rate to be 10 4, #runs = 10, #rollouts = 10, lt = 30, discount factor γ = 0.9 and α = 1/ #runs. All the simulations have been run for 20 trials and the results shown are averaged over them. We capped the queue sizes at 1000. and For the queuing theoretic simulations of Algorithm 2 ACIL, we choose α = 10 4, β = 10 3. We choose the identity mapping φ(s) s, where s is the current state of the system which is a N length vector, which consists of the ith queue length at the ith position. λ was chosen to be 0.1. The other parameters are chosen as B = 50, H = 30 and Tc = 20.