Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Exponential Family Model-Based Reinforcement Learning via Score Matching

Authors: Gene Li, Junbo Li, Anmol Kabra, Nati Srebro, Zhaoran Wang, Zhuoran Yang

NeurIPS 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose an optimistic model-based algorithm, dubbed SMRL, for finite-horizon episodic reinforcement learning (RL) when the transition model is specified by exponential family distributions with d parameters and the reward is bounded and known. SMRL achieves O(d H3T) online regret... We demonstrate end-to-end benefits of using score matching in a (highly stylized) synthetic MDP; see Figure 1.
Researcher Affiliation Academia Gene Li Toyota Technological Institute at Chicago EMAIL Junbo Li UC Santa Cruz EMAIL Anmol Kabra Toyota Technological Institute at Chicago EMAIL Nathan Srebro Toyota Technological Institute at Chicago EMAIL Zhaoran Wang Northwestern University EMAIL Zhuoran Yang Yale University EMAIL
Pseudocode Yes Algorithm 1 Score Matching for RL (SMRL)
Open Source Code Yes Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]
Open Datasets No We demonstrate end-to-end benefits of using score matching in a (highly stylized) synthetic MDP; see Figure 1.
Dataset Splits No No specific training, validation, or test dataset splits are mentioned. The paper uses a synthetic MDP setup rather than a pre-existing dataset with standard splits.
Hardware Specification No Experiments ran on laptop.
Software Dependencies No No specific software dependencies with version numbers are provided.
Experiment Setup Yes Figure 1: Comparing SM vs fitting an LDS for a synthetic MDP, with S = R, A = {+1, 1}, H = 10, initial state distribution Unif([ 1, +1]), P(s |s, a) = exp( s 1.7/1.7) exp(sin(4s )(s + a)), and r(s, a) = exp( 10(s π/8)2) + exp( 10(s + 3π/8)2)... To enable fair comparison, we fix a simple random sampling shooting planner [39] and evaluate three model estimation procedures.