Interactively Learning Preference Constraints in Linear Bandits

Authors: David Lindner, Sebastian Tschiatschek, Katja Hofmann, Andreas Krause

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform three experiments. First, in Section 4.1, we consider synthetic CBAI instances to evaluate ACOL and compare it to natural baselines. Additionally, we investigate the effect of various heuristic modifications to the algorithm. Second, in Section 4.2, we compare ACOL to algorithms that safely minimize regret. And, third, in Section 4.3, we consider learning constraints that represent human preferences in a simulated driving scenario.
Researcher Affiliation Collaboration David Lindner 1 Sebastian Tschiatschek 2 Katja Hofmann 3 Andreas Krause 1 1ETH Zurich, Switzerland 2University of Vienna, Austria 3Microsoft Research Cambridge, UK. Correspondence to: David Lindner <david.lindner@inf.ethz.ch>.
Pseudocode Yes Algorithm 1 Adaptive Constraint Learning (ACOL). Algorithm 2 Greedy Adaptive Constraint Learning (GACOL). Algorithm 3 Round based algorithm with a generic allocation λ with hyperparamater v (1, 2). Algorithm 4 Cross-entropy method for (constrained) reinforcement learning.
Open Source Code Yes We provide more details on the experiments in Appendix C and we provide the full source code to reproduce our experiments.1 For all experiments we use a significance of δ = 0.05 and, if not stated differently, observations have Gaussian noise with σ = 0.05. 1https://github.com/lasgroup/adaptive-constraint-learning
Open Datasets Yes As an example of this, we consider a driving simulator, which Sadigh et al. (2017) originally introduced to study learning reward functions to represent human preferences about driving behavior.
Dataset Splits No The paper does not specify exact percentages or sample counts for training, validation, or test splits. It mentions using synthetic instances and an existing driving simulator, but not how data within these were partitioned for different phases.
Hardware Specification No The paper does not specify any hardware details (e.g., CPU, GPU models, or memory) used for running the experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., Python version, library versions) needed for reproducibility.
Experiment Setup Yes For all experiments we use a significance of δ = 0.05 and, if not stated differently, observations have Gaussian noise with σ = 0.05. Our Driver environment uses a fixed time horizon T = 20, and policies are represented simply as sequences of 20 actions because the environment is deterministic. Algorithm 4 (Cross-entropy method) lists niter, nsamp, and nelite as inputs, indicating these are configurable parameters for its training.