Sequential Preference Ranking for Efficient Reinforcement Learning from Human Feedback
Authors: Minyoung Hwang, Gunmin Lee, Hogun Kee, Chan Woo Kim, Kyungjae Lee, Songhwai Oh
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of the proposed method in both locomotion tasks from Deepmind Control Suite (DMControl) [3] and manipulation tasks from Meta-World [4]. We evaluate our method on robotic locomotion tasks from Deep Mind Control Suite (DMControl) [3, 26] and robotic manipulation tasks from Meta-World [4]. |
| Researcher Affiliation | Academia | 1Electrical and Computer Engineering and ASRI Seoul National University Seoul, 08826, Korea 2Department of Artificial Intelligence Chung-Ang University Seoul 06974, Korea |
| Pseudocode | Yes | The full procedure of our algorithm is provided in the supplementary material. |
| Open Source Code | Yes | Project page: https://rllab-snu.github.io/projects/SeqRank |
| Open Datasets | Yes | We evaluate our method on robotic locomotion tasks from Deep Mind Control Suite (DMControl) [3, 26] and robotic manipulation tasks from Meta-World [4]. |
| Dataset Splits | No | The paper does not explicitly provide details on train/validation/test dataset splits, specific percentages, or counts of samples for each split. |
| Hardware Specification | No | The paper mentions running experiments in a 'simulation environment' and a 'Mujoco simulator' and using a 'UR-5 robot' for real-world tasks, but does not provide specific hardware details like GPU/CPU models or memory specifications used for computation. |
| Software Dependencies | No | The paper mentions various software components such as SAC, MRN, PEBBLE, DMControl, Meta-World, and Mujoco simulator, but does not provide specific version numbers for any of these dependencies. |
| Experiment Setup | Yes | For each task, we train with 10 different seeds used in prior work [6, 10] and measure the average performance with a standard deviation. We use unsupervised pre-training [7] for 9,000 steps for all experiments. The trajectory encoder is implemented using a two-layer feed-forward neural network, where the input dimension is the combined size of the state and action spaces, and the output dimension is set to 256. The linear reward model then takes the encoded feature and passes it through a single fully connected neural network. |