Preferences Implicit in the State of the World

Authors: Rohin Shah, Dmitrii Krasheninnikov, Jordan Alexander, Pieter Abbeel, Anca Dragan

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 EVALUATION Evaluation of RLSP is non-trivial. ... We created a suite of environments with a true reward Rtrue, a specified reward Rspec, Alice’s first state sT, and the robot’s initial state s0, where Rspec ignores some aspect(s) of Rtrue. ... We inspect the inferred reward qualitatively and measure the expected amount of true reward obtained when planning with θfinal, as a fraction of the expected true reward from the optimal policy.Table 1: Performance of algorithms on environments designed to test particular properties.
Researcher Affiliation Academia Rohin Shah UC Berkeley Dmitrii Krasheninnikov University of Amsterdam Jordan Alexander Stanford University Pieter Abbeel UC Berkeley Anca D. Dragan UC Berkeley
Pseudocode Yes Algorithm 1 MCMC sampling from the one state IRL posterior
Open Source Code Yes Our code can be found at https://github.com/Human Compatible AI/rlsp.
Open Datasets No The paper describes custom-built 'proof-of-concept environments' such as 'Room with vase', 'Toy train', and 'Apple collection', but does not provide explicit access information (links, DOIs, or formal citations) to publicly available datasets used for training or evaluation.
Dataset Splits No The paper does not provide specific details on training, validation, and test dataset splits (e.g., percentages, sample counts, or explicit splitting methodology) for its experiments.
Hardware Specification No The paper does not provide any specific hardware details such as GPU or CPU models, processor types, or memory used for running its experiments.
Software Dependencies No The paper does not provide specific software dependencies or library versions (e.g., Python, PyTorch, TensorFlow, or specific RL frameworks with version numbers) used for implementation.
Experiment Setup Yes We tune the hyperparameter λ controlling the tradeoff between Rspec and the human reward for all algorithms, including baselines. We use a Gaussian prior over the reward parameters.We vary the value of T assumed by RLSP, and report the true return achieved by πRLSP obtained using the inferred reward and a fixed horizon for the robot to act.For the Bayesian method, we vary the standard deviation σ of the Gaussian prior over θAlice that is centered at θspec. ... We vary the parameter that controls the tradeoff and report the true reward obtained by πRLSP, as a fraction of the expected true reward under the optimal policy.