Hindsight PRIORs for Reward Learning from Human Preferences

Authors: Mudit Verma, Katherine Metcalf

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the benefits of Hindsight PRIOR on the Deep Mind Control (DMC) Suite locomotion (Tunyasuvunakool et al., 2020) and Meta World control (Yu et al., 2020) tasks, compare against baselines (Lee et al., 2021a; Park et al., 2022; Liu et al., 2022; Liang et al., 2022), and ablate over Hindsight PRIOR s contributions.
Researcher Affiliation Collaboration Mudit Verma Arizona State University Tempe, AZ, 85281 muditverma@asu.edu Katherine Metcalf Apple Inc. Cupertino, CA, 95014 kmetcalf@apple.com
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any statement or link indicating that the source code for the described methodology is open-source or publicly available.
Open Datasets Yes We evaluate the benefits of Hindsight PRIOR on the Deep Mind Control (DMC) Suite locomotion (Tunyasuvunakool et al., 2020) and Meta World control (Yu et al., 2020) tasks
Dataset Splits No The paper does not explicitly provide specific dataset split information (exact percentages, sample counts, or detailed splitting methodology) for training, validation, and testing.
Hardware Specification No The paper does not specify any particular GPU models, CPU models, or other detailed hardware specifications used for running its experiments.
Software Dependencies No The paper mentions software components like 'Python' and optimizers like 'Adam', but does not provide specific version numbers for these or any other ancillary software dependencies.
Experiment Setup Yes Table 2: Training hyper-parameters for SAC (Haarnoja et al., 2018). Table 3: PEBBLE hyper-parameters (Lee et al., 2021a). Table 4: Hindsight PRIOR hyper-parameters. Table 5: World Model hyper-parameters in Hindsight PRIOR