Behavior Alignment via Reward Function Optimization

Authors: Dhawal Gupta, Yash Chandak, Scott Jordan, Philip S. Thomas, Bruno C. da Silva

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method s efficacy on a diverse set of tasks, from small-scale experiments to high-dimensional control challenges. ... 6 Empirical Analyses ... Table 1: Summary of the performance of various reward combination methods and types of raux
Researcher Affiliation Academia Dhawal Gupta University of Massachusetts Yash Chandak Stanford University Scott M. Jordan University of Alberta Philip S. Thomas University of Massachusetts Bruno Castro da Silva University of Massachusetts
Pseudocode Yes C.4 Pseudo Code (Algorithm 5) ... Algorithm 5: BARFI: Behavior Alignment Reward Function s Implicit optimization
Open Source Code No The paper does not provide an explicit statement or a link to its own open-source code for the described methodology.
Open Datasets Yes Mountain Car (MC) [58], ... Cart Pole (CP) [16] ... Half Cheetah-v4 from Mujoco (MJ) suite of Open AI Gym [9]
Dataset Splits No The paper describes data collection via agent interaction and uses 'batches of trajectories' for policy updates, but it does not specify explicit training, validation, and testing *dataset splits* in the traditional sense, as is common in offline learning.
Hardware Specification Yes The computer is used for a cluster where the CPU class is Intel Xeon Gold 6240 CPU @2.60GHz.
Software Dependencies No The paper mentions using 'Py Torch [46]' and 'Open AI Gym [9]', but does not provide explicit version numbers for these or other software dependencies used in their implementation.
Experiment Setup Yes Table 3: Hyper-parameters for Grid World, Table 4: Hyper-parameters for Mountain Car, Table 5: Hyper-parameters for Cart Pole, Table 6: Hyper-parameters for Mu Joco