Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Self-Updatable Large Language Models by Integrating Context into Model Parameters
Authors: Yu Wang, Xinshuang Liu, Xiusi Chen, Sean OBrien, Junda Wu, Julian McAuley
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluations on question-answering and conversational recommendation tasks demonstrate that SELFPARAM significantly outperforms existing methods, even when accounting for non-zero storage requirements. This advancement paves the way for more efficient and scalable integration of experiences in large language models by embedding knowledge directly into model parameters. Code is open-sourced at https://github.com/XinshuangL/SELF-PARAM |
| Researcher Affiliation | Academia | Yu Wang1 , Xinshuang Liu1 , Xiusi Chen2, Sean O brien1, Junda Wu1, Julian Mc Auley1 1University of California San Diego, 2University of Illinois Urbana-Champaign |
| Pseudocode | No | The paper describes the methodology in Section 3, titled 'METHODOLOGY', using textual descriptions and mathematical equations (e.g., Eq. 1, 2, 3), but it does not include a clearly labeled pseudocode block or algorithm. |
| Open Source Code | Yes | Code is open-sourced at https://github.com/XinshuangL/SELF-PARAM |
| Open Datasets | Yes | We use Pw C dataset (Ge et al., 2023), consisting of triples in the form (context, question, answer). We utilize two datasets: (1) INSPIRED (Hayati et al., 2020): Contains 731 conversational interactions. (2) REDIAL (Li et al., 2018): Comprises 7,415 conversational interactions. In this research, we use publicly available datasets (Pw C, INSPIRED, and REDIAL) that do not contain personally identifiable information, ensuring compliance with data privacy standards and licensing agreements. |
| Dataset Splits | Yes | From these, we select the first 100 contexts paired with 225 questions for the single context injection task. ... Then we extract 100 and 500 contexts, with 225 questions and 1044 questions, respectively, from the obtained subset to perform batch injection. ... We construct a list of 20 unique contexts from the Pw C dataset (Ge et al., 2023) and inject them into the model one after another in a sequential manner. After each injection step, we assess the model s performance by calculating the QA-F1 score on all questions related to the injected contexts. |
| Hardware Specification | Yes | For all the experiments, we conduct experiments with eight NVIDIA-RTX-A6000 GPUs. |
| Software Dependencies | No | The KL divergence is computed using the torch.nn.functional.kl div function from the Py Torch library. For the backbone model Openllama-3B-v2, we train the MLP layers. For Mistral-7B, Mistral-7B-instruct-v0.2, Llama3-8B, we use Lo RA (Hu et al., 2021) from the package peft (Mangrulkar et al., 2022). While PyTorch and peft are mentioned, specific version numbers are not provided for these software components. |
| Experiment Setup | Yes | The learning rate is set to 2e-5 for training. We train for 50 epochs in both Single Context Injection and Batch Context Injection, 20 epochs in Sequential Injection, and 1 epoch in Conversational Recommendation. ... The Lo RA configurations are: {inference mode: false, r: 8, lora alpha: 32, lora dropout: 0.1, target modules: ["q proj", "v proj", "k proj", "up proj", "down proj", "gate proj"] }. |