Preference-based Reinforcement Learning with Finite-Time Guarantees
Authors: Yichong Xu, Ruosong Wang, Lin Yang, Aarti Singh, Artur Dubrawski
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We performed experiments in synthetic environments to compare PEPS with previous baselines. We consider two environments: Grid World: We implemented a simple Grid World on a 4 4 grid. The agent goes from the upper left corner to the lower right corner and can choose to go right or go down at each step. We randomly put a reward of 1/3 on three blocks in the grid, and the maximal total reward is 2/3. Random MDP: We followed the method in [25] but adapted it to our setting. We consider an MDP with 20 states and 5 steps, with 4 states in each step. The transitions are sampled from a Dirichlet prior (with parameters all set to 0.1) and the rewards are sampled from an exponential prior with scale parameter λ = 5. The rewards are then shifted and normalized so that the minimum reward is 0 and the mean reward is 1. and Figure 1: Experiment Results comparing PEPS to baselines (DPS & EPMC). |
| Researcher Affiliation | Collaboration | Yichong Xu1 , Ruosong Wang2, Lin F. Yang3, Aarti Singh2, Artur Dubrawski2 1Microsoft 2Carnegie Mellon University 3University of California, Los Angles |
| Pseudocode | Yes | Algorithm 1 PPS: Preference-based Policy Search and Algorithm 2 PEPS: Preferece-based Exploration and Policy Search |
| Open Source Code | No | The paper does not contain any explicit statement about releasing the source code for their proposed methodology, nor does it provide a direct link to a code repository. |
| Open Datasets | No | Grid World: We implemented a simple Grid World on a 4 4 grid. The agent goes from the upper left corner to the lower right corner and can choose to go right or go down at each step. We randomly put a reward of 1/3 on three blocks in the grid, and the maximal total reward is 2/3. Random MDP: We followed the method in [25] but adapted it to our setting. We consider an MDP with 20 states and 5 steps, with 4 states in each step. The transitions are sampled from a Dirichlet prior (with parameters all set to 0.1) and the rewards are sampled from an exponential prior with scale parameter λ = 5. The paper describes how the environments were implemented or generated, but does not provide access information (link, DOI, formal citation) to a publicly available dataset. |
| Dataset Splits | No | The paper describes experiments in simulated environments and reinforcement learning settings, which typically do not involve explicit training/validation/test dataset splits in the same way as supervised learning tasks. There is no mention of specific percentages, counts, or predefined splits for reproduction. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory, or cluster specifications) used to run the experiments. |
| Software Dependencies | No | The paper does not mention any specific software dependencies with version numbers (e.g., programming languages, libraries, frameworks, or solvers) that would be needed to replicate the experiments. |
| Experiment Setup | Yes | For both environments, we varied the budget N [2S , 8S ], where S is the number of non-terminating states. The comparisons are generated following the Bradley-Terry-Luce model [9]: φ(τ1, τ2) = 1 1+exp( (r(τ1) r(τ2))/c), with c being either 0.001 or 1. In the first setting of c, the preferences are very close to deterministic while comparison between equal rewards is uniformly random; in the latter setting, the preferences are close to linear in the reward difference. We repeated each experiment for 32 times and computed the mean and standard deviation. |