Noise Contrastive Alignment of Language Models with Explicit Rewards
Authors: Huayu Chen, Guande He, Lifan Yuan, Ganqu Cui, Hang Su, Jun Zhu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our methods in both reward and preference settings with Mistral-8 7B and 7B models. Experiments suggest that Info NCA/NCA surpasses various preference baselines when reward datasets are available. We also find NCA significantly outperforms DPO in complex reasoning tasks like math and coding. |
| Researcher Affiliation | Academia | 1Department of Computer Science and Technology, Tsinghua University 2Institute for AI, BNRist Center, Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University 3Zhongguancun Laboratory, Beijing, China |
| Pseudocode | Yes | Py Torch code for the Info NCA/NCA loss for reward datasets is provided below: def reward_loss(pi_logps, ref_logps, rewards, alpha, beta, loss_type): |
| Open Source Code | Yes | Code: https: //github.com/thu-ml/Noise-Contrastive-Alignment. |
| Open Datasets | Yes | We consider Ultra Feedback [9], an instruction-following dataset annotated by GPT-4. This dataset comprises 64k instructions. |
| Dataset Splits | No | The paper trains models on specific datasets (Ultra Feedback, Ultra Interact) and evaluates them on separate benchmarks (MT-bench, Alpaca Eval). It specifies training parameters like epochs and batch size, but it does not explicitly provide a dedicated validation split from its primary training datasets that is used for hyperparameter tuning or early stopping during training. |
| Hardware Specification | Yes | Experiments are run on Nvidia A40 or RTX 4090 GPUs using bfloat16 precision. |
| Software Dependencies | No | The paper mentions "Py Torch code" and refers to using "Transformer Reinforcement Learning (TRL) library [41] and Zephyr s official code base [40]". However, it does not provide specific version numbers for PyTorch, TRL, or the Zephyr codebase, which is required for reproducible software dependencies. |
| Experiment Setup | Yes | We ablate β {3e 4, 1e 3, 3e 3, 1e 2, 3e 2, 1e 1, 3e 1, 1.0} and α {0.01, 0.1, 0.33, 1.0, 3.33}. The default reward temperature α is 0.01. The default parameterization coefficient β is also 0.01. We adopt the QLo RA [10] fine-tuning technique with rank 16, αlora = 16, and a dropout rate of 0.05. We train all models for 1 epoch. The batch size is 32. We use an Adam W optimizer with a learning rating of 5e-6. |