Latent Paraphrasing: Perturbation on Layers Improves Knowledge Injection in Language Models

Authors: Minki Kang, Sung Ju Hwang, Gibbeum Lee, Jaewoong Cho

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments on question-answering benchmarks demonstrate that La Pael improves knowledge injection over standard fine-tuning and existing noise-based approaches.
Researcher Affiliation Collaboration Minki Kang1,2 Sung Ju Hwang2 Gibbeum Lee1 Jaewoong Cho1 1KRAFTON, 2KAIST
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No We do not open-source the code yet. However, we will open-source it if the paper is accepted.
Open Datasets Yes We mainly use the test split of three QA datasets: SQu AD [38], Streaming QA [27], and Archival QA [51] for the source of DK and DQA in our main experiments.
Dataset Splits Yes Table 11: Dataset statistics. We report the size of Dtrain, DK, and DQA used in our experiments. For SQu AD, Dtrain is 1,000, DK is 1,000, and DQA is 1,000.
Hardware Specification Yes We use 4 A100 GPUs for fine-tuning LLMs.
Software Dependencies Yes We mainly use Vicuna-7b-v1.5 [56] for fine-tuning, which is the instructiontuned version of Llama-2-7b [48] for our experiments. We also verify with Mistral-7B-Instruct-v0.2 [18], and Phi3-mini-4k-instruct [1].
Experiment Setup Yes We fine-tune LLMs for 12 epochs with a learning rate of 0.00005 and step learning rate scheduler where we decay a learning rate by 0.85 by every 4 epochs. For optimizer, we use Adam W [28]. ... We use 5 latent paraphrasers on the 5 sequential early layers of LLMs. For Equation (13), we use N = 4. For Equation (14), we use K = 10. For Equation (15), we set r = 0.5.