Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation

Authors: Mingjie Li, Wai Man Si, Michael Backes, Yang Zhang, Yisen Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we conduct a variety of experiments to demonstrate the efficiency of our Sa Lo RA in improving domain-specific abilities while preserving LLM s safety alignment. Firstly, we train four widely used LLMs on Alpaca (Wang et al., 2023) with Sa Lo RA and other Lo RA methods, and then we compare the harmful rate of these models responses with and without some state-of-the-art posthoc alignment methods on unsafe prompts. Then, we compare the utility of different Lo RA-trained models with our Sa Lo RA on commonsense reasoning. We also conduct ablation studies on ranks influence in Appendix C and examples of different methods outputs in Appendix D.
Researcher Affiliation Academia 1 CISPA Helmholtz Center for Information Security 2 State Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 3 Institute for Artificial Intelligence, Peking University
Pseudocode No The paper includes mathematical formulations and diagrams of model architectures (Figure 2, Figure 4) but no explicitly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes Code is available at https://github.com/homles11/Sa Lo RA.
Open Datasets Yes Firstly, we train four widely used LLMs on Alpaca (Wang et al., 2023) with Sa Lo RA and other Lo RA methods... We use 70% harmful prompts in the Adv Bench dataset (Zou et al., 2023b) along with their safe responses... We test eight commonsense reasoning tasks on Llama-2-chat-7B fine-tuned on commonsense-15k (Talmor et al., 2019) with different methods...
Dataset Splits Yes We use 70% harmful prompts in the Adv Bench dataset (Zou et al., 2023b) for Infer Aligner, Vaccine, and our Sa Lo RA. We use the rest of 30% harmful prompts in Adv Bench for the safety evaluation, denoted as Adv Bench s test subset.
Hardware Specification Yes The experiments finished on a single NVIDIA A100-80GB GPU.
Software Dependencies No The paper mentions using the PEFT (Mangrulkar et al., 2022) training pipeline and Adam W (Loshchilov, 2019) optimizer but does not specify version numbers for these or other software libraries like Python, PyTorch, or TensorFlow.
Experiment Setup Yes We first train them on the Alpaca datasets with Adam W (Loshchilov, 2019) for 1 epoch with the learning rate equal to 0.0002. Then evaluate the harmfulness score on its responses with the Llama-Guard-3-8B (Llama Team, 2024). We use 70% harmful prompts in the Adv Bench dataset (Zou et al., 2023b) for Infer Aligner, Vaccine, and our Sa Lo RA. We use the rest of 30% harmful prompts in Adv Bench for the safety evaluation, denoted as Adv Bench s test subset. The experiments finished on the PEFT (Mangrulkar et al., 2022) training pipeline with a batch size equal to 16 for all our experiments.