Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Preference-based Reinforcement Learning with Finite-Time Guarantees
Authors: Yichong Xu, Ruosong Wang, Lin Yang, Aarti Singh, Artur Dubrawski
NeurIPS 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We performed experiments in synthetic environments to compare PEPS with previous baselines. We consider two environments: Grid World: We implemented a simple Grid World on a 4 4 grid. The agent goes from the upper left corner to the lower right corner and can choose to go right or go down at each step. We randomly put a reward of 1/3 on three blocks in the grid, and the maximal total reward is 2/3. Random MDP: We followed the method in [25] but adapted it to our setting. We consider an MDP with 20 states and 5 steps, with 4 states in each step. The transitions are sampled from a Dirichlet prior (with parameters all set to 0.1) and the rewards are sampled from an exponential prior with scale parameter Ξ» = 5. The rewards are then shifted and normalized so that the minimum reward is 0 and the mean reward is 1. and Figure 1: Experiment Results comparing PEPS to baselines (DPS & EPMC). |
| Researcher Affiliation | Collaboration | Yichong Xu1 , Ruosong Wang2, Lin F. Yang3, Aarti Singh2, Artur Dubrawski2 1Microsoft 2Carnegie Mellon University 3University of California, Los Angles |
| Pseudocode | Yes | Algorithm 1 PPS: Preference-based Policy Search and Algorithm 2 PEPS: Preferece-based Exploration and Policy Search |
| Open Source Code | No | The paper does not contain any explicit statement about releasing the source code for their proposed methodology, nor does it provide a direct link to a code repository. |
| Open Datasets | No | Grid World: We implemented a simple Grid World on a 4 4 grid. The agent goes from the upper left corner to the lower right corner and can choose to go right or go down at each step. We randomly put a reward of 1/3 on three blocks in the grid, and the maximal total reward is 2/3. Random MDP: We followed the method in [25] but adapted it to our setting. We consider an MDP with 20 states and 5 steps, with 4 states in each step. The transitions are sampled from a Dirichlet prior (with parameters all set to 0.1) and the rewards are sampled from an exponential prior with scale parameter Ξ» = 5. The paper describes how the environments were implemented or generated, but does not provide access information (link, DOI, formal citation) to a publicly available dataset. |
| Dataset Splits | No | The paper describes experiments in simulated environments and reinforcement learning settings, which typically do not involve explicit training/validation/test dataset splits in the same way as supervised learning tasks. There is no mention of specific percentages, counts, or predefined splits for reproduction. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory, or cluster specifications) used to run the experiments. |
| Software Dependencies | No | The paper does not mention any specific software dependencies with version numbers (e.g., programming languages, libraries, frameworks, or solvers) that would be needed to replicate the experiments. |
| Experiment Setup | Yes | For both environments, we varied the budget N [2S , 8S ], where S is the number of non-terminating states. The comparisons are generated following the Bradley-Terry-Luce model [9]: Ο(Ο1, Ο2) = 1 1+exp( (r(Ο1) r(Ο2))/c), with c being either 0.001 or 1. In the ο¬rst setting of c, the preferences are very close to deterministic while comparison between equal rewards is uniformly random; in the latter setting, the preferences are close to linear in the reward difference. We repeated each experiment for 32 times and computed the mean and standard deviation. |