reproducibilityindex.ai

Sparse Reinforcement Learning via Convex Optimization

Authors: Zhiwei Qin, Weichang Li, Firdaus Janoos

ICML 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the performance of these algorithms through several classical examples. We demonstrate through experiment results that regularized PBR minimization might be preferable over regularized minimization of the ﬁxed-point slack for value function approximation. Section 5. Experiments: We compared the proposed ADMM algorithm for BPDN-based sparse reinforcement learning with LSTD and DLSTD (Geist et al., 2012) on two test problems, which are described in the next two sections. Figures 1 and 2 show the comparison among BPDN, DLSTD, and LSTD on the approximation error (RMSE) of the Q value function and the simulated reward of the learned policies for two cases of irrelevant features. The average CPU times for BPDN on one instance of value approximation are 11.4s for Nnoise = 500 and 15.7s for Nnoise = 1000, and those for D-LSTD are 6.6s and 31.7s respectively.
Researcher Affiliation	Collaboration	Zhiwei (Tony) Qin1 TQIN@WALMARTLABS.COM @Walmart Labs, 850 Cherry Ave, San Bruno, CA 94066 Weichang Li, Firdaus Janoos WEICHANG.LI, FIRDAUS.JANOOS@EXXONMOBIL.COM Exxon Mobil Corporate Strategic Research, 1545 Rt. 22 East, Annandale, NJ 08801 1The work was completed when the ﬁrst author was a summer intern with Exxon Mobil Corporate Strategic Research in 2012 working towards a Ph.D. degree at Columbia University, New York.
Pseudocode	Yes	Algorithm 1 ADMM-BPDN and Algorithm 2 RDA with LMS update are provided in Section 4.
Open Source Code	No	The paper does not state that they provide open-source code for their methodology. It mentions: “Special thanks to Bo Liu and Ji Liu for sharing the sample ROTD code” which indicates they used someone else’s code, not that they provided their own.
Open Datasets	Yes	5.1.1. CHAIN WALK: This is the classical 20-state test example from (Lagoudakis & Parr, 2003).
Dataset Splits	No	The paper mentions collecting
Hardware Specification	No	The paper mentions “average CPU times” but does not specify the CPU model, GPU, or any other detailed hardware specifications used for the experiments.
Software Dependencies	No	The paper mentions specific software like “Mosek LP solver” and references algorithms (LARS, FISTA, FPC-BB) but does not provide version numbers for any software dependencies used in their experiments.
Experiment Setup	Yes	For approximating the optimal value function, we collected 1000 training samples off-policy by following a random policy for 100 episodes, 10 steps each. For the Star example, we set ρ = 0 to verify the convergence of the PBR to zero, and hence, the correctness of the algorithm.