STARC: A General Framework For Quantifying Differences Between Reward Functions
Authors: Joar Max Viktor Skalse, Lucy Farnik, Sumeet Ramesh Motwani, Erik Jenner, Adam Gleave, Alessandro Abate
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we evaluate our metrics empirically, to demonstrate their practical efficacy. STARC metrics can be used to make both theoretical and empirical analysis of reward learning algorithms both easier and more principled. |
| Researcher Affiliation | Collaboration | Joar Skalse Department of Computer Science Future of Humanity Institute Oxford University joar.skalse@cs.ox.ac.uk Lucy Farnik University of Bristol Bristol AI Safety Centre lucy.farnik@bristol.ac.uk Sumeet Ramesh Motwani Berkeley Artificial Intelligence Research University of California, Berkeley motwani@berkeley.edu Erik Jenner Berkeley Artificial Intelligence Research University of California, Berkeley jenner@berkeley.edu Adam Gleave FAR AI, Inc. adam@far.ai Alessandro Abate Department of Computer Science Oxford University aabate@cs.ox.ac.uk |
| Pseudocode | No | The paper does not contain any explicitly labeled "Pseudocode" or "Algorithm" blocks. |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | No | The paper describes generating its own Markov Decision Processes (MDPs) and reward functions, rather than using or providing access to a pre-existing, publicly available dataset. For example, Section G.1 states: "The transition distribution τps, a, s1q was generated as follows:..." and "In the random generation stage, we choose two random rewards R1, R2 using the following procedure:" |
| Dataset Splits | No | The paper describes generating environments and reward functions for experiments, but it does not specify explicit training/validation/test dataset splits for a fixed dataset, such as percentages, absolute counts, or citations to standard splits. |
| Hardware Specification | Yes | We used the Balrog GPU cluster at UC Berkeley, which consists of 8 A100 GPUs, each with 40 GB memory, along with 96 CPU cores. |
| Software Dependencies | No | The paper mentions software like SARSA and Adam W, but it does not provide specific version numbers for these or other key software components, which are necessary for reproducible descriptions. |
| Experiment Setup | No | While the paper provides details on environment parameters (e.g., discount factor, state/action generation) and reward function construction, it lacks specific training hyperparameters (e.g., learning rates, batch sizes, number of epochs) for the learning algorithms used (e.g., SARSA for Vπ approximation). |