Pragmatic Feature Preferences: Learning Reward-Relevant Preferences from Human Input
Authors: Andi Peng, Yuying Sun, Tianmin Shu, David Abel
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach on linear bandit settings in both visionand language-based domains. Results support the efficiency of our approach in quickly converging to accurate rewards with fewer comparisons vs. example-only labels. Finally, we validate the real-world applicability with a behavioral experiment on a mushroom foraging task. |
| Researcher Affiliation | Collaboration | 1Massachusetts Institute of Technology 2Boston University 3Johns Hopkins University 4Google Deep Mind. |
| Pseudocode | Yes | Algorithm 1 Pragmatic Feature Preference Augmentation |
| Open Source Code | Yes | Code available at github.com/andipeng/feature-preference |
| Open Datasets | Yes | The original dataset can be found at github.com/jlin816/rewards-from-language. |
| Dataset Splits | No | The paper does not explicitly provide specific percentages or counts for training, validation, or test dataset splits. It mentions 'training reward models' but lacks the detailed split information required for reproducibility. |
| Hardware Specification | No | The paper does not specify any hardware components (e.g., CPU, GPU models, memory) used for running the experiments or training the models. |
| Software Dependencies | No | The paper mentions implementing reward models as 'linear networks' and prompting 'GPT-4', but it does not specify any other software libraries, frameworks, or their version numbers necessary for replication. |
| Experiment Setup | Yes | We implement all reward models as linear networks (single layer, no activations). Each feature predictor in the joint model is trained independently without sharing parameters, and their resulting outputs are concatenated and fed through a final layer for reward prediction. We swept possible β values and found 0.5 consistently achieved the best performance. |