Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch

Authors: Malek Mechergui, Sarath Sreedharan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method on a set of standard Markov Decision Process (MDP) benchmarks. and We empirically demonstrate how the method compares against baseline methods for handling reward uncertainty in benchmark domains.
Researcher Affiliation Academia Malek Mechergui, Sarath Sreedharan Colorado State University Fort Collins, 80523 {Malek.Mechergui, Sarath.Sreedharan}@colostate.edu
Pseudocode Yes Algorithm 1 provides the pseudo-code for the query procedure.
Open Source Code Yes The code for our experiments can be found at https://github.com/Malek-Mechergui/codeMDP and We have included a zip of the code along with instructions. There was no dataset. and We wrote all the codes included, and they will be released with an open-source license.
Open Datasets Yes Most of these are standard benchmark tasks taken from the Simple RL library [Abel, 2019].
Dataset Splits No The paper does not explicitly provide training, validation, or test dataset splits in terms of percentages or sample counts. It describes using 'five random instantiations of each grid size' for evaluation.
Hardware Specification Yes All experiments were run on Alma Linux 8.9 with 32GB RAM and 16 Intel(R) Xeon(R) 2.60GHz CPUs.
Software Dependencies No We used CPLEX [Bliek1รบ et al., 2014] as our LP solver (no-cost edition)4. The paper mentions CPLEX but does not provide a specific version number.
Experiment Setup Yes All the baselines were run with a time-bound of 30 minutes per problem. and We have specified the solver used. Given these are just using LP formulations of MDPs we didn t have any hyperparameters to select. and For each of the tasks, the expectation set consists of reaching the goal state and avoiding some random states in the environment. The human models were generated by modifying the original task slightly.