Preferences Implicit in the State of the World
Authors: Rohin Shah, Dmitrii Krasheninnikov, Jordan Alexander, Pieter Abbeel, Anca Dragan
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 EVALUATION Evaluation of RLSP is non-trivial. ... We created a suite of environments with a true reward Rtrue, a specified reward Rspec, Alice’s first state sT, and the robot’s initial state s0, where Rspec ignores some aspect(s) of Rtrue. ... We inspect the inferred reward qualitatively and measure the expected amount of true reward obtained when planning with θfinal, as a fraction of the expected true reward from the optimal policy.Table 1: Performance of algorithms on environments designed to test particular properties. |
| Researcher Affiliation | Academia | Rohin Shah UC Berkeley Dmitrii Krasheninnikov University of Amsterdam Jordan Alexander Stanford University Pieter Abbeel UC Berkeley Anca D. Dragan UC Berkeley |
| Pseudocode | Yes | Algorithm 1 MCMC sampling from the one state IRL posterior |
| Open Source Code | Yes | Our code can be found at https://github.com/Human Compatible AI/rlsp. |
| Open Datasets | No | The paper describes custom-built 'proof-of-concept environments' such as 'Room with vase', 'Toy train', and 'Apple collection', but does not provide explicit access information (links, DOIs, or formal citations) to publicly available datasets used for training or evaluation. |
| Dataset Splits | No | The paper does not provide specific details on training, validation, and test dataset splits (e.g., percentages, sample counts, or explicit splitting methodology) for its experiments. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU or CPU models, processor types, or memory used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies or library versions (e.g., Python, PyTorch, TensorFlow, or specific RL frameworks with version numbers) used for implementation. |
| Experiment Setup | Yes | We tune the hyperparameter λ controlling the tradeoff between Rspec and the human reward for all algorithms, including baselines. We use a Gaussian prior over the reward parameters.We vary the value of T assumed by RLSP, and report the true return achieved by πRLSP obtained using the inferred reward and a fixed horizon for the robot to act.For the Bayesian method, we vary the standard deviation σ of the Gaussian prior over θAlice that is centered at θspec. ... We vary the parameter that controls the tradeoff and report the true reward obtained by πRLSP, as a fraction of the expected true reward under the optimal policy. |