Preference-based Reinforcement Learning with Finite-Time Guarantees

Authors: Yichong Xu, Ruosong Wang, Lin Yang, Aarti Singh, Artur Dubrawski

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We performed experiments in synthetic environments to compare PEPS with previous baselines. We consider two environments: Grid World: We implemented a simple Grid World on a 4 4 grid. The agent goes from the upper left corner to the lower right corner and can choose to go right or go down at each step. We randomly put a reward of 1/3 on three blocks in the grid, and the maximal total reward is 2/3. Random MDP: We followed the method in [25] but adapted it to our setting. We consider an MDP with 20 states and 5 steps, with 4 states in each step. The transitions are sampled from a Dirichlet prior (with parameters all set to 0.1) and the rewards are sampled from an exponential prior with scale parameter λ = 5. The rewards are then shifted and normalized so that the minimum reward is 0 and the mean reward is 1. and Figure 1: Experiment Results comparing PEPS to baselines (DPS & EPMC).
Researcher Affiliation Collaboration Yichong Xu1 , Ruosong Wang2, Lin F. Yang3, Aarti Singh2, Artur Dubrawski2 1Microsoft 2Carnegie Mellon University 3University of California, Los Angles
Pseudocode Yes Algorithm 1 PPS: Preference-based Policy Search and Algorithm 2 PEPS: Preferece-based Exploration and Policy Search
Open Source Code No The paper does not contain any explicit statement about releasing the source code for their proposed methodology, nor does it provide a direct link to a code repository.
Open Datasets No Grid World: We implemented a simple Grid World on a 4 4 grid. The agent goes from the upper left corner to the lower right corner and can choose to go right or go down at each step. We randomly put a reward of 1/3 on three blocks in the grid, and the maximal total reward is 2/3. Random MDP: We followed the method in [25] but adapted it to our setting. We consider an MDP with 20 states and 5 steps, with 4 states in each step. The transitions are sampled from a Dirichlet prior (with parameters all set to 0.1) and the rewards are sampled from an exponential prior with scale parameter λ = 5. The paper describes how the environments were implemented or generated, but does not provide access information (link, DOI, formal citation) to a publicly available dataset.
Dataset Splits No The paper describes experiments in simulated environments and reinforcement learning settings, which typically do not involve explicit training/validation/test dataset splits in the same way as supervised learning tasks. There is no mention of specific percentages, counts, or predefined splits for reproduction.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory, or cluster specifications) used to run the experiments.
Software Dependencies No The paper does not mention any specific software dependencies with version numbers (e.g., programming languages, libraries, frameworks, or solvers) that would be needed to replicate the experiments.
Experiment Setup Yes For both environments, we varied the budget N [2S , 8S ], where S is the number of non-terminating states. The comparisons are generated following the Bradley-Terry-Luce model [9]: φ(τ1, τ2) = 1 1+exp( (r(τ1) r(τ2))/c), with c being either 0.001 or 1. In the first setting of c, the preferences are very close to deterministic while comparison between equal rewards is uniformly random; in the latter setting, the preferences are close to linear in the reward difference. We repeated each experiment for 32 times and computed the mean and standard deviation.