Reward learning from human preferences and demonstrations in Atari
Authors: Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, Dario Amodei
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method on the Arcade Learning Environment (Bellemare et al., 2013) because Atari games are RL problems difficult enough to benefit from nonlinear function approximation and currently among the most diverse environments for RL. Moreover, Atari games provide well-specified true reward functions, which allows us to objectively evaluate the performance of our method and to do more rapid experimentation with synthetic (simulated) human preferences based on the game reward. |
| Researcher Affiliation | Industry | Borja Ibarz Deep Mind bibarz@google.com Jan Leike Deep Mind leike@google.com Tobias Pohlen Deep Mind pohlen@google.com Geoffrey Irving Open AI irving@openai.com Deep Mind legg@google.com Dario Amodei Open AI damodei@openai.com |
| Pseudocode | Yes | Algorithm 1 Training protocol 1: The expert provides a set of demonstrations. 2: Pretrain the policy on the demonstrations using behavioral cloning using loss JE. 3: Run the policy in the environment and store these initial trajectories. 4: Sample pairs of clips (short trajectory segments) from the initial trajectories. 5: The annotator labels the pairs of clips, which get added to an annotation buffer. 6: (Optionally) automatically generate annotated pairs of clips from the demonstrations and add them to the annotation buffer. 7: Train the reward model from the annotation buffer. 8: Pretrain of the policy on the demonstrations, with rewards from the reward model. 9: for M iterations do 10: Train the policy in the environment for N steps with reward from the reward model. 11: Select pairs of clips from the resulting trajectories. 12: The annotator labels the pairs of clips, which get added to the annotation buffer. 13: Train the reward model for k batches from the annotation buffer. 14: end for |
| Open Source Code | No | The paper does not provide an explicit statement or link to its open-source code for the methodology described. |
| Open Datasets | Yes | We evaluate our method on the Arcade Learning Environment (Bellemare et al., 2013) |
| Dataset Splits | No | The paper does not explicitly provide details about training, validation, or test dataset splits. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions algorithms and frameworks like DQf D, DQN, A3C, but does not provide specific version numbers for software dependencies or libraries. |
| Experiment Setup | Yes | The training objective for the agent’s policy is the the cost function J(Q) = JP DDQn(Q) + λ2JE(Q) + λ3JL2(Q). The hyperparameters λ2 and λ3 are scalar constants. The agent’s behavior is -greedy with respect to the action-value function Q(o, ; ). ... Since the training set is relatively small (a few thousand pairs of clips) we incorporate a number of modifications to prevent overfitting: adaptive regularization, Gaussian noise on the input, L2 regularization on the output (details in Appendix A). Finally, since the reward model is trained only on comparisons, its scale is arbitrary, and we normalize it every 100,000 agent steps to be zero-mean and have standard deviation 0.05 over the annotation buffer A. ... In each experimental setup (except for imitation learning) we compare four feedback schedules. The full schedule consists of 6800 labels (500 initial and 6300 spread along the training protocol). The other three schedules reduce the total amount of feedback by a factor of 2, 4 and 6 respectively (see details in Appendix A). |