Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Joint Localization and Activation Editing for Low-Resource Fine-Tuning

Authors: Wen Lai, Alexander Fraser, Ivan Titov

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate JOLA across three benchmark categories: commonsense reasoning, natural language understanding, and natural language generation. Experimental results on 26 tasks from the benchmarks (Hu et al., 2023; Wang et al., 2024b; Gehrmann et al., 2022) demonstrate that JOLA consistently outperforms existing methods in low-resource settings (as shown in Figure 4), delivering robust performance across various data scales and model sizes.
Researcher Affiliation	Academia	1Technical University of Munich 2Munich Center for Machine Learning 3ILLC, University of Edinburgh 4ILLC, University of Amsterdam. Correspondence to: Wen Lai <EMAIL>, Ivan Titov <EMAIL>, Alexander Fraser <EMAIL>.
Pseudocode	No	The paper describes the methodology using mathematical equations and descriptive text, but it does not contain a dedicated pseudocode block or algorithm section.
Open Source Code	Yes	1The code for the method is released at https://github. com/wenlai-lavine/jola.
Open Datasets	Yes	For commonsense reasoning, we utilize a widely adopted benchmark (Hu et al., 2023; Wu et al., 2024b) containing 8 datasets: ARC-c and ARC-e (Clark et al., 2018), Bool Q (Clark et al., 2019), Hella Swag (Zellers et al., 2019), OBQA (Mihaylov et al., 2018), PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019), and Wino Grande (Sakaguchi et al., 2021). We evaluate on the MMLU-Pro benchmark (Wang et al., 2024b), covering 14 domains... For generation tasks, we select 4 datasets from GEM benchmark (Gehrmann et al., 2022), including Common Gen (Lin et al., 2020) for concept-to-sentence generation, E2E (Novikova et al., 2017) and Web NLG (Gardent et al., 2017) for data-to-text generation, and XSum (Narayan et al., 2018) for abstractive summarization of long documents.
Dataset Splits	Yes	For all datasets, we sample 200 examples to simulate low-resource scenarios, with further analysis of data size effects provided in Section 6. ... To ensure consistency across experiments, we used the same random seed (seed= 42) for data sampling, ensuring identical training samples in all runs.
Hardware Specification	Yes	We conduct all experiments using the Hugging Face Transformers8 library and fine-tuned the models with the TRL toolkit9. The Adam W optimizer (Loshchilov, 2017) was used for fine-tuning, with ϵ = 1e 6 and one epoch of warm-up. Given the small dataset (e.g., 200 samples in our setting), overfitting was a concern. To mitigate overfitting s impact on the baseline, we introduced early stopping, which was not applied in the original implementation of the baseline systems. We also found that learning rate adjustment significantly affected the results. To optimize the learning rate strategy, we evaluated four strategies: (1) linear schedule (Mnih et al., 2015), (2) Cyclic Learning Rate Schedule (Smith, 2017), (3) Adaptive Heuristic Schedule (Smith, 2018), and (4) Exponential Decay Schedule (Li & Arora, 2019). As shown in Table 8, the exponential decay strategy proved to be the most stable, so we used it for both the baseline and our method. The exponentially decaying learning rate schedule is defined by the following formula: lr(t) = lr0 λt e decay t (9) where lr0 is the initial learning rate lr0set to 5 10 4 , λ is 0.1, and the decay rate is 0.01. For the gating units, we used a temperature of 0.33 in the Gumbel Softmax (Jang et al., 2017). Fine-tuning was performed in full precision for the 7B, 8B, 1B, and 3B models, while for the 70B model, we applied 4-bit quantization to enable single-precision fine-tuning.
Software Dependencies	No	We conduct all experiments using the Hugging Face Transformers8 library and fine-tuned the models with the TRL toolkit9. The Adam W optimizer (Loshchilov, 2017) was used for fine-tuning, with ϵ = 1e 6 and one epoch of warm-up.
Experiment Setup	Yes	The Adam W optimizer (Loshchilov, 2017) was used for fine-tuning, with ϵ = 1e 6 and one epoch of warm-up. ... The exponentially decaying learning rate schedule is defined by the following formula: lr(t) = lr0 λt e decay t (9) where lr0 is the initial learning rate lr0set to 5 10 4 , λ is 0.1, and the decay rate is 0.01. For the gating units, we used a temperature of 0.33 in the Gumbel Softmax (Jang et al., 2017). Fine-tuning was performed in full precision for the 7B, 8B, 1B, and 3B models, while for the 70B model, we applied 4-bit quantization to enable single-precision fine-tuning. ... For each task, we run a grid search with five different hyperparameter configurations, which are chosen to explore a diverse range of parameter settings that could provide the best model performance. We performed this search over key hyperparameters, as presented in Table 7.