reproducibilityindex.ai

Exponential Family Model-Based Reinforcement Learning via Score Matching

Authors: Gene Li, Junbo Li, Anmol Kabra, Nati Srebro, Zhaoran Wang, Zhuoran Yang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We propose an optimistic model-based algorithm, dubbed SMRL, for ﬁnite-horizon episodic reinforcement learning (RL) when the transition model is speciﬁed by exponential family distributions with d parameters and the reward is bounded and known. SMRL achieves O(d H3T) online regret... We demonstrate end-to-end beneﬁts of using score matching in a (highly stylized) synthetic MDP; see Figure 1.
Researcher Affiliation	Academia	Gene Li Toyota Technological Institute at Chicago gene@ttic.edu Junbo Li UC Santa Cruz jli753@ucsc.edu Anmol Kabra Toyota Technological Institute at Chicago anmol@ttic.edu Nathan Srebro Toyota Technological Institute at Chicago nati@ttic.edu Zhaoran Wang Northwestern University zhaoranwang@gmail.com Zhuoran Yang Yale University zhuoran.yang@yale.edu
Pseudocode	Yes	Algorithm 1 Score Matching for RL (SMRL)
Open Source Code	Yes	Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]
Open Datasets	No	We demonstrate end-to-end beneﬁts of using score matching in a (highly stylized) synthetic MDP; see Figure 1.
Dataset Splits	No	No specific training, validation, or test dataset splits are mentioned. The paper uses a synthetic MDP setup rather than a pre-existing dataset with standard splits.
Hardware Specification	No	Experiments ran on laptop.
Software Dependencies	No	No specific software dependencies with version numbers are provided.
Experiment Setup	Yes	Figure 1: Comparing SM vs ﬁtting an LDS for a synthetic MDP, with S = R, A = {+1, 1}, H = 10, initial state distribution Unif([ 1, +1]), P(s \|s, a) = exp( s 1.7/1.7) exp(sin(4s )(s + a)), and r(s, a) = exp( 10(s π/8)2) + exp( 10(s + 3π/8)2)... To enable fair comparison, we ﬁx a simple random sampling shooting planner [39] and evaluate three model estimation procedures.