Model-Free Preference-Based Reinforcement Learning
Authors: Christian Wirth, Johannes Fürnkranz, Gerhard Neumann
AAAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, the learner does not have access to the true reward signal but is given preference feedback based on the undiscounted sum of rewards. All reported results are averaged over 20 trials, except the Acrobot domain where we had to reduce to 10. |
| Researcher Affiliation | Academia | Christian Wirth, Johannes F urnkranz, Gerhard Neumann Technische Universit at Darmstadt Germany |
| Pseudocode | No | The paper describes algorithmic steps in text but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper mentions a third-party tool's website ('http://www.gurobi.com/') but does not state that the authors' own source code for their methodology is publicly available. |
| Open Datasets | Yes | As a first testing domain, we use the Gridworld defined by Akrour et al. (2014). The second task is the bicycle balance task (Lagoudakis and Parr 2003). The third domain is the inverted pendulum swing-up task. The last domain is the acrobot task (Sutton and Barto 1998). |
| Dataset Splits | No | The paper mentions minimizing 'training set error' for hyperparameter tuning but does not specify any explicit validation set splits or methodology (e.g., percentages, cross-validation). |
| Hardware Specification | No | The paper does not provide specific hardware details such as CPU/GPU models, memory, or cloud instance types used for running experiments. |
| Software Dependencies | No | The paper mentions 'Gurobi Solver' but does not specify its version number or other software dependencies with specific versions. |
| Experiment Setup | Yes | We always collect 10 trajectories per iteration and use a discount factor of γ = 0.98. We request 1 preference per iteration. For the elliptic slice sampling, we utilize 100k samples each for burnin and evaluation. The ϵ bound of AC-REPS, the sigmoid shape parameter m and the variance of the prior σ are manually tuned on the preference-based tasks. |