Sparse Reinforcement Learning via Convex Optimization

Authors: Zhiwei Qin, Weichang Li, Firdaus Janoos

ICML 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the performance of these algorithms through several classical examples. We demonstrate through experiment results that regularized PBR minimization might be preferable over regularized minimization of the fixed-point slack for value function approximation. Section 5. Experiments: We compared the proposed ADMM algorithm for BPDN-based sparse reinforcement learning with LSTD and DLSTD (Geist et al., 2012) on two test problems, which are described in the next two sections. Figures 1 and 2 show the comparison among BPDN, DLSTD, and LSTD on the approximation error (RMSE) of the Q value function and the simulated reward of the learned policies for two cases of irrelevant features. The average CPU times for BPDN on one instance of value approximation are 11.4s for Nnoise = 500 and 15.7s for Nnoise = 1000, and those for D-LSTD are 6.6s and 31.7s respectively.
Researcher Affiliation Collaboration Zhiwei (Tony) Qin1 TQIN@WALMARTLABS.COM @Walmart Labs, 850 Cherry Ave, San Bruno, CA 94066 Weichang Li, Firdaus Janoos WEICHANG.LI, FIRDAUS.JANOOS@EXXONMOBIL.COM Exxon Mobil Corporate Strategic Research, 1545 Rt. 22 East, Annandale, NJ 08801 1The work was completed when the first author was a summer intern with Exxon Mobil Corporate Strategic Research in 2012 working towards a Ph.D. degree at Columbia University, New York.
Pseudocode Yes Algorithm 1 ADMM-BPDN and Algorithm 2 RDA with LMS update are provided in Section 4.
Open Source Code No The paper does not state that they provide open-source code for their methodology. It mentions: “Special thanks to Bo Liu and Ji Liu for sharing the sample ROTD code” which indicates they used someone else’s code, not that they provided their own.
Open Datasets Yes 5.1.1. CHAIN WALK: This is the classical 20-state test example from (Lagoudakis & Parr, 2003).
Dataset Splits No The paper mentions collecting
Hardware Specification No The paper mentions “average CPU times” but does not specify the CPU model, GPU, or any other detailed hardware specifications used for the experiments.
Software Dependencies No The paper mentions specific software like “Mosek LP solver” and references algorithms (LARS, FISTA, FPC-BB) but does not provide version numbers for any software dependencies used in their experiments.
Experiment Setup Yes For approximating the optimal value function, we collected 1000 training samples off-policy by following a random policy for 100 episodes, 10 steps each. For the Star example, we set ρ = 0 to verify the convergence of the PBR to zero, and hence, the correctness of the algorithm.