On Softmax Direct Preference Optimization for Recommendation
Authors: Yuxin Chen, Junfei Tan, An Zhang, Zhengyi Yang, Leheng Sheng, Enzhi Zhang, Xiang Wang, Tat-Seng Chua
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, extensive experiments conducted on three real-world datasets demonstrate the superiority of S-DPO to effectively model user preference and further boost recommendation performance while providing better rewards for preferred items. |
| Researcher Affiliation | Academia | 1National University of Singapore 2University of Science and Technology of China 3Hokkaido University |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our codes are available at https://github.com/chenyuxin1999/S-DPO. |
| Open Datasets | Yes | We conduct extensive experiments on three real-world benchmark datasets which differ in size and domain (Movielens [50], Goodreads3, and Last FM [51]). |
| Dataset Splits | Yes | For all datasets, we organize sequences chronologically before dividing the data into training, validation, and testing sets in an 8:1:1 ratio to prevent any potential information leakage. |
| Hardware Specification | Yes | We implement all approaches with Python 3.9.7, Py Torch 2.2.2, and transformers 4.38.2 on 4 NVIDIA A100 GPUs. |
| Software Dependencies | Yes | We implement all approaches with Python 3.9.7, Py Torch 2.2.2, and transformers 4.38.2 on 4 NVIDIA A100 GPUs. |
| Experiment Setup | Yes | For optimization of all the traditional methods, the Adam optimizer is employed with a learning rate adjusted to 0.001, and a batch size configured at 256. All models undergo L2 regularization, with coefficients experimentally determined from [1e-3, 1e-4, 1e-5, 1e-6, 1e-7]. In all experiments involving large language models, we train each method for a maximum of 5 epochs using a batch size of 128 and select the checkpoint with the lowest loss on the validation set as the final checkpoint. A warm-up strategy is applied to the learning rate, starting at 5% of its maximum value, and gradually adjusting it through a cosine scheduler throughout the training process. For S-DPO and all of its ablation studies, we further conduct preference training for further 3 epochs with a batch size of 128 and a learning rate of 1e-5. Setting the value of β as 1, we search the number of negative samples in [3,5] for the main results. |