Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Self-Updatable Large Language Models by Integrating Context into Model Parameters

Authors: Yu Wang, Xinshuang Liu, Xiusi Chen, Sean OBrien, Junda Wu, Julian McAuley

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluations on question-answering and conversational recommendation tasks demonstrate that SELFPARAM significantly outperforms existing methods, even when accounting for non-zero storage requirements. This advancement paves the way for more efficient and scalable integration of experiences in large language models by embedding knowledge directly into model parameters. Code is open-sourced at https://github.com/XinshuangL/SELF-PARAM
Researcher Affiliation	Academia	Yu Wang1 , Xinshuang Liu1 , Xiusi Chen2, Sean O brien1, Junda Wu1, Julian Mc Auley1 1University of California San Diego, 2University of Illinois Urbana-Champaign
Pseudocode	No	The paper describes the methodology in Section 3, titled 'METHODOLOGY', using textual descriptions and mathematical equations (e.g., Eq. 1, 2, 3), but it does not include a clearly labeled pseudocode block or algorithm.
Open Source Code	Yes	Code is open-sourced at https://github.com/XinshuangL/SELF-PARAM
Open Datasets	Yes	We use Pw C dataset (Ge et al., 2023), consisting of triples in the form (context, question, answer). We utilize two datasets: (1) INSPIRED (Hayati et al., 2020): Contains 731 conversational interactions. (2) REDIAL (Li et al., 2018): Comprises 7,415 conversational interactions. In this research, we use publicly available datasets (Pw C, INSPIRED, and REDIAL) that do not contain personally identifiable information, ensuring compliance with data privacy standards and licensing agreements.
Dataset Splits	Yes	From these, we select the first 100 contexts paired with 225 questions for the single context injection task. ... Then we extract 100 and 500 contexts, with 225 questions and 1044 questions, respectively, from the obtained subset to perform batch injection. ... We construct a list of 20 unique contexts from the Pw C dataset (Ge et al., 2023) and inject them into the model one after another in a sequential manner. After each injection step, we assess the model s performance by calculating the QA-F1 score on all questions related to the injected contexts.
Hardware Specification	Yes	For all the experiments, we conduct experiments with eight NVIDIA-RTX-A6000 GPUs.
Software Dependencies	No	The KL divergence is computed using the torch.nn.functional.kl div function from the Py Torch library. For the backbone model Openllama-3B-v2, we train the MLP layers. For Mistral-7B, Mistral-7B-instruct-v0.2, Llama3-8B, we use Lo RA (Hu et al., 2021) from the package peft (Mangrulkar et al., 2022). While PyTorch and peft are mentioned, specific version numbers are not provided for these software components.
Experiment Setup	Yes	The learning rate is set to 2e-5 for training. We train for 50 epochs in both Single Context Injection and Batch Context Injection, 20 epochs in Sequential Injection, and 1 epoch in Conversational Recommendation. ... The Lo RA configurations are: {inference mode: false, r: 8, lora alpha: 32, lora dropout: 0.1, target modules: ["q proj", "v proj", "k proj", "up proj", "down proj", "gate proj"] }.