Exponential Family Model-Based Reinforcement Learning via Score Matching

Authors: Gene Li, Junbo Li, Anmol Kabra, Nati Srebro, Zhaoran Wang, Zhuoran Yang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose an optimistic model-based algorithm, dubbed SMRL, for finite-horizon episodic reinforcement learning (RL) when the transition model is specified by exponential family distributions with d parameters and the reward is bounded and known. SMRL achieves O(d H3T) online regret... We demonstrate end-to-end benefits of using score matching in a (highly stylized) synthetic MDP; see Figure 1.
Researcher Affiliation Academia Gene Li Toyota Technological Institute at Chicago gene@ttic.edu Junbo Li UC Santa Cruz jli753@ucsc.edu Anmol Kabra Toyota Technological Institute at Chicago anmol@ttic.edu Nathan Srebro Toyota Technological Institute at Chicago nati@ttic.edu Zhaoran Wang Northwestern University zhaoranwang@gmail.com Zhuoran Yang Yale University zhuoran.yang@yale.edu
Pseudocode Yes Algorithm 1 Score Matching for RL (SMRL)
Open Source Code Yes Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]
Open Datasets No We demonstrate end-to-end benefits of using score matching in a (highly stylized) synthetic MDP; see Figure 1.
Dataset Splits No No specific training, validation, or test dataset splits are mentioned. The paper uses a synthetic MDP setup rather than a pre-existing dataset with standard splits.
Hardware Specification No Experiments ran on laptop.
Software Dependencies No No specific software dependencies with version numbers are provided.
Experiment Setup Yes Figure 1: Comparing SM vs fitting an LDS for a synthetic MDP, with S = R, A = {+1, 1}, H = 10, initial state distribution Unif([ 1, +1]), P(s |s, a) = exp( s 1.7/1.7) exp(sin(4s )(s + a)), and r(s, a) = exp( 10(s π/8)2) + exp( 10(s + 3π/8)2)... To enable fair comparison, we fix a simple random sampling shooting planner [39] and evaluate three model estimation procedures.