Unpacking Reward Shaping: Understanding the Benefits of Reward Engineering on Sample Complexity

Authors: Abhishek Gupta, Aldo Pacchiano, Yuexiang Zhai, Sham Kakade, Sergey Levine

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We confirm that these results hold in practice in an experimental evaluation, providing an insight into the mechanisms through which reward shaping can significantly improve the complexity of reinforcement learning while retaining asymptotic performance. and 7 Numerical Simulations To show the practical relevance of our analysis on reward shaping we perform some numerical simulations on a family of maze environments with tabular state-action representations, as shown in Fig. 3.
Researcher Affiliation Collaboration Abhishek Gupta University of Washington abhgupta@cs.washington.edu Aldo Pacchiano Microsoft Research, NYC apacchiano@microsoft.com Yuexiang Zhai UC Berkeley, EECS simonzhai@berkeley.edu Sham M. Kakade Harvard University sham@seas.harvard.edu Sergey Levine UC Berkeley, EECS svlevine@eecs.berkeley.edu
Pseudocode Yes Algorithm 1 UCBVI Shaped and Algorithm 2 Value Iteration with Projection
Open Source Code Yes Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See Appendix A for link to URL and run instructions in the README in the github repo.
Open Datasets Yes We used three open source domains and collected our own data on these domains.
Dataset Splits No The paper describes the experimental setup and numerical simulations but does not explicitly detail training, validation, and test dataset splits with percentages or counts.
Hardware Specification Yes We perform our experiments on a linux machine running Ubuntu 20.04.2 LTS with 64GB of RAM and Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.
Software Dependencies Yes The code is written in Python 3.8.10, and uses standard libraries such as numpy 1.21.6 and matplotlib 3.5.2.
Experiment Setup Yes e V is constructed by scaling the optimal value function V ? by per-state scaling factors sampled independently within the range b. ... with various levels of imperfect shaping applied by varying b = {1.5,1.9}.