reproducibilityindex.ai

Sequential Preference Ranking for Efficient Reinforcement Learning from Human Feedback

Authors: Minyoung Hwang, Gunmin Lee, Hogun Kee, Chan Woo Kim, Kyungjae Lee, Songhwai Oh

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of the proposed method in both locomotion tasks from Deepmind Control Suite (DMControl) [3] and manipulation tasks from Meta-World [4]. We evaluate our method on robotic locomotion tasks from Deep Mind Control Suite (DMControl) [3, 26] and robotic manipulation tasks from Meta-World [4].
Researcher Affiliation	Academia	1Electrical and Computer Engineering and ASRI Seoul National University Seoul, 08826, Korea 2Department of Artificial Intelligence Chung-Ang University Seoul 06974, Korea
Pseudocode	Yes	The full procedure of our algorithm is provided in the supplementary material.
Open Source Code	Yes	Project page: https://rllab-snu.github.io/projects/SeqRank
Open Datasets	Yes	We evaluate our method on robotic locomotion tasks from Deep Mind Control Suite (DMControl) [3, 26] and robotic manipulation tasks from Meta-World [4].
Dataset Splits	No	The paper does not explicitly provide details on train/validation/test dataset splits, specific percentages, or counts of samples for each split.
Hardware Specification	No	The paper mentions running experiments in a 'simulation environment' and a 'Mujoco simulator' and using a 'UR-5 robot' for real-world tasks, but does not provide specific hardware details like GPU/CPU models or memory specifications used for computation.
Software Dependencies	No	The paper mentions various software components such as SAC, MRN, PEBBLE, DMControl, Meta-World, and Mujoco simulator, but does not provide specific version numbers for any of these dependencies.
Experiment Setup	Yes	For each task, we train with 10 different seeds used in prior work [6, 10] and measure the average performance with a standard deviation. We use unsupervised pre-training [7] for 9,000 steps for all experiments. The trajectory encoder is implemented using a two-layer feed-forward neural network, where the input dimension is the combined size of the state and action spaces, and the output dimension is set to 256. The linear reward model then takes the encoded feature and passes it through a single fully connected neural network.