STARC: A General Framework For Quantifying Differences Between Reward Functions

Authors: Joar Max Viktor Skalse, Lucy Farnik, Sumeet Ramesh Motwani, Erik Jenner, Adam Gleave, Alessandro Abate

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we evaluate our metrics empirically, to demonstrate their practical efficacy. STARC metrics can be used to make both theoretical and empirical analysis of reward learning algorithms both easier and more principled.
Researcher Affiliation Collaboration Joar Skalse Department of Computer Science Future of Humanity Institute Oxford University joar.skalse@cs.ox.ac.uk Lucy Farnik University of Bristol Bristol AI Safety Centre lucy.farnik@bristol.ac.uk Sumeet Ramesh Motwani Berkeley Artificial Intelligence Research University of California, Berkeley motwani@berkeley.edu Erik Jenner Berkeley Artificial Intelligence Research University of California, Berkeley jenner@berkeley.edu Adam Gleave FAR AI, Inc. adam@far.ai Alessandro Abate Department of Computer Science Oxford University aabate@cs.ox.ac.uk
Pseudocode No The paper does not contain any explicitly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code No The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets No The paper describes generating its own Markov Decision Processes (MDPs) and reward functions, rather than using or providing access to a pre-existing, publicly available dataset. For example, Section G.1 states: "The transition distribution τps, a, s1q was generated as follows:..." and "In the random generation stage, we choose two random rewards R1, R2 using the following procedure:"
Dataset Splits No The paper describes generating environments and reward functions for experiments, but it does not specify explicit training/validation/test dataset splits for a fixed dataset, such as percentages, absolute counts, or citations to standard splits.
Hardware Specification Yes We used the Balrog GPU cluster at UC Berkeley, which consists of 8 A100 GPUs, each with 40 GB memory, along with 96 CPU cores.
Software Dependencies No The paper mentions software like SARSA and Adam W, but it does not provide specific version numbers for these or other key software components, which are necessary for reproducible descriptions.
Experiment Setup No While the paper provides details on environment parameters (e.g., discount factor, state/action generation) and reward function construction, it lacks specific training hyperparameters (e.g., learning rates, batch sizes, number of epochs) for the learning algorithms used (e.g., SARSA for Vπ approximation).