Latent Paraphrasing: Perturbation on Layers Improves Knowledge Injection in Language Models
Authors: Minki Kang, Sung Ju Hwang, Gibbeum Lee, Jaewoong Cho
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments on question-answering benchmarks demonstrate that La Pael improves knowledge injection over standard fine-tuning and existing noise-based approaches. |
| Researcher Affiliation | Collaboration | Minki Kang1,2 Sung Ju Hwang2 Gibbeum Lee1 Jaewoong Cho1 1KRAFTON, 2KAIST |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | We do not open-source the code yet. However, we will open-source it if the paper is accepted. |
| Open Datasets | Yes | We mainly use the test split of three QA datasets: SQu AD [38], Streaming QA [27], and Archival QA [51] for the source of DK and DQA in our main experiments. |
| Dataset Splits | Yes | Table 11: Dataset statistics. We report the size of Dtrain, DK, and DQA used in our experiments. For SQu AD, Dtrain is 1,000, DK is 1,000, and DQA is 1,000. |
| Hardware Specification | Yes | We use 4 A100 GPUs for fine-tuning LLMs. |
| Software Dependencies | Yes | We mainly use Vicuna-7b-v1.5 [56] for fine-tuning, which is the instructiontuned version of Llama-2-7b [48] for our experiments. We also verify with Mistral-7B-Instruct-v0.2 [18], and Phi3-mini-4k-instruct [1]. |
| Experiment Setup | Yes | We fine-tune LLMs for 12 epochs with a learning rate of 0.00005 and step learning rate scheduler where we decay a learning rate by 0.85 by every 4 epochs. For optimizer, we use Adam W [28]. ... We use 5 latent paraphrasers on the 5 sequential early layers of LLMs. For Equation (13), we use N = 4. For Equation (14), we use K = 10. For Equation (15), we set r = 0.5. |