On Softmax Direct Preference Optimization for Recommendation

Authors: Yuxin Chen, Junfei Tan, An Zhang, Zhengyi Yang, Leheng Sheng, Enzhi Zhang, Xiang Wang, Tat-Seng Chua

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, extensive experiments conducted on three real-world datasets demonstrate the superiority of S-DPO to effectively model user preference and further boost recommendation performance while providing better rewards for preferred items.
Researcher Affiliation Academia 1National University of Singapore 2University of Science and Technology of China 3Hokkaido University
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Our codes are available at https://github.com/chenyuxin1999/S-DPO.
Open Datasets Yes We conduct extensive experiments on three real-world benchmark datasets which differ in size and domain (Movielens [50], Goodreads3, and Last FM [51]).
Dataset Splits Yes For all datasets, we organize sequences chronologically before dividing the data into training, validation, and testing sets in an 8:1:1 ratio to prevent any potential information leakage.
Hardware Specification Yes We implement all approaches with Python 3.9.7, Py Torch 2.2.2, and transformers 4.38.2 on 4 NVIDIA A100 GPUs.
Software Dependencies Yes We implement all approaches with Python 3.9.7, Py Torch 2.2.2, and transformers 4.38.2 on 4 NVIDIA A100 GPUs.
Experiment Setup Yes For optimization of all the traditional methods, the Adam optimizer is employed with a learning rate adjusted to 0.001, and a batch size configured at 256. All models undergo L2 regularization, with coefficients experimentally determined from [1e-3, 1e-4, 1e-5, 1e-6, 1e-7]. In all experiments involving large language models, we train each method for a maximum of 5 epochs using a batch size of 128 and select the checkpoint with the lowest loss on the validation set as the final checkpoint. A warm-up strategy is applied to the learning rate, starting at 5% of its maximum value, and gradually adjusting it through a cosine scheduler throughout the training process. For S-DPO and all of its ablation studies, we further conduct preference training for further 3 epochs with a batch size of 128 and a learning rate of 1e-5. Setting the value of β as 1, we search the number of negative samples in [3,5] for the main results.