Perplexity-aware Correction for Robust Alignment with Noisy Preferences
Authors: Keyi Kong, Xilie Xu, Di Wang, Jingfeng ZHANG, Mohan S. Kankanhalli
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments validate that our proposed Perp Correct can achieve state-of-the-art alignment performance under NPs. |
| Researcher Affiliation | Academia | Keyi Kong1, Xilie Xu2, Di Wang3 Jingfeng Zhang4,5, Mohan Kankanhalli2 1Shandong University 2National University of Singapore 3King Abdullah University of Science and Technology 4The University of Auckland 5RIKEN Center for Advanced Intelligence Project (AIP) |
| Pseudocode | Yes | The algorithm of Perp Correct is described in Algorithm 2. ... Algorithm 1 Robust Alignment via Perplexity-aware Correction (Perp Correct) ... Algorithm 2 Perplexity-aware Correction (Perp Correct) |
| Open Source Code | Yes | Our code is available at Perp Correct. ... Our training code are open-source on Git Hub. |
| Open Datasets | Yes | We utilize two preference datasets, namely Open Assistant Conversations (OASST1) [17] and Golden HH [7]. |
| Dataset Splits | Yes | The processed OASST1 dataset comprises 17,939 training samples and 951 testing samples and the processed Golden HH dataset consists of 12,066 training samples and 654 testing samples. ... Table 5 illustrates the impact of the number of clean validation data points. |
| Hardware Specification | Yes | We utilized the Qlora method [11] for fine-tuning the LLMs, executed on RTX 4090 GPUs with 24 GB of memory. ... Each experiment, involving a specific method and proportion of NPs, could be completed using a single RTX 4090 GPU within 24 hours on the Golden HH dataset and within 72 hours on the OASST1 dataset. |
| Software Dependencies | No | The paper mentions using 'transformers and TRL libraries' and 'Adam W optimizer' but does not specify their version numbers, which is required for a reproducible description of software dependencies. |
| Experiment Setup | Yes | Hyperparameters were set as follows: lora_rank = 32, lora_dropout = 0.1, and lora_alpha = 16. For SFT, we use the alpaca dataset [30] and set learning_rate = 2e 4 and batch_size = 20. For our Perp Correct stage II, we set β = 0.1, learning_rate = 1e 3, batch_size = 4, T = 5, and α = 0.02. For our Perp Correct stage III and all other alignment methods, we set β = 0.1, learning_rate = 3e 4, and batch_size = 20. |