Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Exponential Family Model-Based Reinforcement Learning via Score Matching
Authors: Gene Li, Junbo Li, Anmol Kabra, Nati Srebro, Zhaoran Wang, Zhuoran Yang
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose an optimistic model-based algorithm, dubbed SMRL, for finite-horizon episodic reinforcement learning (RL) when the transition model is specified by exponential family distributions with d parameters and the reward is bounded and known. SMRL achieves O(d H3T) online regret... We demonstrate end-to-end benefits of using score matching in a (highly stylized) synthetic MDP; see Figure 1. |
| Researcher Affiliation | Academia | Gene Li Toyota Technological Institute at Chicago EMAIL Junbo Li UC Santa Cruz EMAIL Anmol Kabra Toyota Technological Institute at Chicago EMAIL Nathan Srebro Toyota Technological Institute at Chicago EMAIL Zhaoran Wang Northwestern University EMAIL Zhuoran Yang Yale University EMAIL |
| Pseudocode | Yes | Algorithm 1 Score Matching for RL (SMRL) |
| Open Source Code | Yes | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] |
| Open Datasets | No | We demonstrate end-to-end benefits of using score matching in a (highly stylized) synthetic MDP; see Figure 1. |
| Dataset Splits | No | No specific training, validation, or test dataset splits are mentioned. The paper uses a synthetic MDP setup rather than a pre-existing dataset with standard splits. |
| Hardware Specification | No | Experiments ran on laptop. |
| Software Dependencies | No | No specific software dependencies with version numbers are provided. |
| Experiment Setup | Yes | Figure 1: Comparing SM vs fitting an LDS for a synthetic MDP, with S = R, A = {+1, 1}, H = 10, initial state distribution Unif([ 1, +1]), P(s |s, a) = exp( s 1.7/1.7) exp(sin(4s )(s + a)), and r(s, a) = exp( 10(s π/8)2) + exp( 10(s + 3π/8)2)... To enable fair comparison, we fix a simple random sampling shooting planner [39] and evaluate three model estimation procedures. |