Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Interweaving Memories of a Siamese Large Language Model

Authors: Xin Song, Zhikai Xue, Guoxiu He, Jiawei Liu, Wei Lu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments across various benchmark datasets, evaluating the performance of popular open-source LLMs using the proposed IMSM, in comparison to both classical and leading PEFT methods. Our findings indicate that IMSM maintains comparable time and space efficiency to backbone PEFT methods while significantly improving performance and effectively mitigating catastrophic forgetting.
Researcher Affiliation Academia 1School of Economics and Management, East China Normal University 2Department of Computer Science, Worcester Polytechnic Institute 3School of Information Management, Wuhan University EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methodology using text and mathematical formulas (e.g., equations 1-8) and a diagram (Figure 2), but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code https://github.com/ECNU-Text-Computing/IMSM
Open Datasets Yes We conduct experiments on four datasets: MRPC (El-Said et al. 2015), Co LA (El-Said et al. 2015), ROPES (Lin et al. 2019), and GSM8K (Cobbe et al. 2021), to evaluate the alignment capability of our IMSM. Following previous studies (Schick et al. 2024; Asai et al. 2023), we also employ MRPC (El-Said et al. 2015), Web Q (Bordes, Chopra, and Weston 2014), Freebase QA (Jiang, Wu, and Jiang 2019), and Multi RC (Khashabi et al. 2018), to assess the abilities to retain general knowledge of LLM.
Dataset Splits No The paper lists several standard benchmark datasets for evaluation (e.g., MRPC, CoLA, ROPES, GSM8K, Web Q, Freebase QA, Multi RC) and mentions evaluating performance on these. However, it does not explicitly state the specific training, validation, and test splits (e.g., percentages or sample counts) used for these datasets within the paper's text. It refers to using standard practices or following previous studies, but without explicit details.
Hardware Specification Yes The fine-tuning procedure is executed on 8 NVIDIA A800 GPUs under a Linux system.
Software Dependencies No We utilize Hugging Face Transformers (Wolf et al. 2019) and PEFT (Mangrulkar et al. 2022) to perform our experiments. This mentions the software but not specific version numbers.
Experiment Setup Yes For Lo RA, Ada Lo RA, and Do RA, we employ Adam W as the optimizer with learning rates of 3 10 4, 2 10 3, and 1 10 4, respectively, and a batch size of 16. The rank and alpha for Lo RA are set to 16. For Do RA, we follow the authors recommendations, setting the rank and alpha to 16 and 32. For (IA)3, we use Adafactor with a learning rate of 3 10 3 and a batch size of 8. All methods are trained for 3 epochs. For Lo RAMo E, we use the original paper s configuration. For a fair comparison, we set the configurations of the tuned target modules of IMSM to be exactly the same as vanilla PEFT. The gate rank r of IMSM is set to 8.