Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning

Authors: Zongmeng Zhang, Yufeng Shi, Jinhua Zhu, Wengang Zhou, Xiang Qi, Peng Zhang, Houqiang Li

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we conduct experiments and analyze TRUSTWORTHY-ALIGNMENT to answer (RQ3) how does the proposed algorithm perform and what are the side effects of aligning retrieval-augmented LLMs to trustworthy status. We first detail the datasets, metrics and compared baselines. Then we analyze the performance, generalization capability, alignment tax and human evaluation of TRUSTWORTHY-ALIGNMENT. Finally, we investigate the impact of individual components of TRUSTWORTHY-ALIGNMENT by carrying out ablation studies on annealing exploration, scaling law and critic model.
Researcher Affiliation Collaboration 1University of Science and Technology of China 2Institute of Artificial Intelligence, Hefei Comprehensive Nation Science Center 3Ant Group.
Pseudocode Yes Algorithm 1: TRUSTWORTHY-ALIGNMENT Algorithm 2: Collapse Detection
Open Source Code Yes Our code is available at https://github.com/zmzhang2000/trustworthy-alignment.
Open Datasets Yes We train TRUSTWORTHY-ALIGNMENT on the MRQA Workshop Shared Task s version3 (Fisch et al., 2019) of Natural Questions (Kwiatkowski et al., 2019). And thus they construct a more coherent and convincing dataset Conflict QA on top of an entitycentric QA dataset Pop QA (Mallen et al., 2023) and a multistep reasoning dataset Strategy QA (Geva et al., 2021) by prompting LLMs to directly generate supporting evidence that contradicts the parametric memory.
Dataset Splits No The original and substituted version of training set is used for training while only substituted version of dev set is used for evaluation. No specific percentages or counts for splits are provided.
Hardware Specification Yes Experiments involving Llama-2-13b-chat or larger critic models are conducted on 8 A100-80G GPUs. Other experiments are conducted on 8 3090 GPUs.
Software Dependencies No We implement TRUSTWORTHY-ALIGNMENT based on the Deep Speed-Chat framework (Yao et al., 2023) with Lo RA (Hu et al., 2021). No specific version numbers for these software components are provided.
Experiment Setup Yes We list hyperparameters in Table 4. All hyperparameters remain the same in the alignment of all pre-trained models. (Table 4 includes: α0, α1, α2, α3, γ, λ, optimizer, training batch size, max prompt sequence length, actor learning rate, critic learning rate, actor weight decay, critic weight decay, training epochs, learning rate scheduler, warmup steps, actor lora dim, actor lora learning rate, critic lora dim, critic lora learning rate, max new token number, do sample, top-p, number of beams, repetition penalty, initial temperature τ initial, final temperature τ final, temperature decay type).