Exponential Family Model-Based Reinforcement Learning via Score Matching
Authors: Gene Li, Junbo Li, Anmol Kabra, Nati Srebro, Zhaoran Wang, Zhuoran Yang
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose an optimistic model-based algorithm, dubbed SMRL, for finite-horizon episodic reinforcement learning (RL) when the transition model is specified by exponential family distributions with d parameters and the reward is bounded and known. SMRL achieves O(d H3T) online regret... We demonstrate end-to-end benefits of using score matching in a (highly stylized) synthetic MDP; see Figure 1. |
| Researcher Affiliation | Academia | Gene Li Toyota Technological Institute at Chicago gene@ttic.edu Junbo Li UC Santa Cruz jli753@ucsc.edu Anmol Kabra Toyota Technological Institute at Chicago anmol@ttic.edu Nathan Srebro Toyota Technological Institute at Chicago nati@ttic.edu Zhaoran Wang Northwestern University zhaoranwang@gmail.com Zhuoran Yang Yale University zhuoran.yang@yale.edu |
| Pseudocode | Yes | Algorithm 1 Score Matching for RL (SMRL) |
| Open Source Code | Yes | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] |
| Open Datasets | No | We demonstrate end-to-end benefits of using score matching in a (highly stylized) synthetic MDP; see Figure 1. |
| Dataset Splits | No | No specific training, validation, or test dataset splits are mentioned. The paper uses a synthetic MDP setup rather than a pre-existing dataset with standard splits. |
| Hardware Specification | No | Experiments ran on laptop. |
| Software Dependencies | No | No specific software dependencies with version numbers are provided. |
| Experiment Setup | Yes | Figure 1: Comparing SM vs fitting an LDS for a synthetic MDP, with S = R, A = {+1, 1}, H = 10, initial state distribution Unif([ 1, +1]), P(s |s, a) = exp( s 1.7/1.7) exp(sin(4s )(s + a)), and r(s, a) = exp( 10(s π/8)2) + exp( 10(s + 3π/8)2)... To enable fair comparison, we fix a simple random sampling shooting planner [39] and evaluate three model estimation procedures. |