reproducibilityindex.ai

On Softmax Direct Preference Optimization for Recommendation

Authors: Yuxin Chen, Junfei Tan, An Zhang, Zhengyi Yang, Leheng Sheng, Enzhi Zhang, Xiang Wang, Tat-Seng Chua

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, extensive experiments conducted on three real-world datasets demonstrate the superiority of S-DPO to effectively model user preference and further boost recommendation performance while providing better rewards for preferred items.
Researcher Affiliation	Academia	1National University of Singapore 2University of Science and Technology of China 3Hokkaido University
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our codes are available at https://github.com/chenyuxin1999/S-DPO.
Open Datasets	Yes	We conduct extensive experiments on three real-world benchmark datasets which differ in size and domain (Movielens [50], Goodreads3, and Last FM [51]).
Dataset Splits	Yes	For all datasets, we organize sequences chronologically before dividing the data into training, validation, and testing sets in an 8:1:1 ratio to prevent any potential information leakage.
Hardware Specification	Yes	We implement all approaches with Python 3.9.7, Py Torch 2.2.2, and transformers 4.38.2 on 4 NVIDIA A100 GPUs.
Software Dependencies	Yes	We implement all approaches with Python 3.9.7, Py Torch 2.2.2, and transformers 4.38.2 on 4 NVIDIA A100 GPUs.
Experiment Setup	Yes	For optimization of all the traditional methods, the Adam optimizer is employed with a learning rate adjusted to 0.001, and a batch size configured at 256. All models undergo L2 regularization, with coefficients experimentally determined from [1e-3, 1e-4, 1e-5, 1e-6, 1e-7]. In all experiments involving large language models, we train each method for a maximum of 5 epochs using a batch size of 128 and select the checkpoint with the lowest loss on the validation set as the final checkpoint. A warm-up strategy is applied to the learning rate, starting at 5% of its maximum value, and gradually adjusting it through a cosine scheduler throughout the training process. For S-DPO and all of its ablation studies, we further conduct preference training for further 3 epochs with a batch size of 128 and a learning rate of 1e-5. Setting the value of β as 1, we search the number of negative samples in [3,5] for the main results.