Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Disentangling Latent Shifts of In-Context Learning with Weak Supervision
Authors: Josip Jukić, Jan Šnajder
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, our method improves generalization, stability, and efficiency across both in-domain and out-of-domain tasks, surpassing standard ICL and prior disentanglement methods. |
| Researcher Affiliation | Academia | Josip Juki c, Jan Šnajder Take Lab Faculty of Electrical Engineering and Computing University of Zagreb, Croatia EMAIL |
| Pseudocode | No | The paper describes methods and processes in paragraph form, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format. |
| Open Source Code | Yes | Our code is available at https://github.com/josipjukic/wilda. |
| Open Datasets | Yes | We assess model performance on seven tasks from the GLUE benchmark [37], covering single-sequence binary classification (COLA, SST, RTE), sequence-pair binary classification (MRPC, QQP, QNLI), and sequence-pair multi-class classification (MNLI). ... Additionally, we measure accuracy on selected datasets from the MMLU benchmark [14], specifically elementary math (MATH) and miscellaneous (MISC). We further extend our analysis to the ARC-Challenge benchmark [7] to assess reasoning and multi-hop generalization, which we show in Appendix D. |
| Dataset Splits | Yes | Evaluations for GLUE are conducted on the development sets, whereas for the MMLU datasets, we randomly sample 200 instances for evaluation. Additionally, for GLUE datasets, we experiment with 200 and 500 instances to assess the impact of the amount of unlabeled data on generalization and stability. We experiment only with 100 unlabeled instances for MMLU datasets due to their limited size. |
| Hardware Specification | Yes | We conducted our experiments on AMD Ryzen Threadripper 3970X 32-Core Processors and 4 NVIDIA Ge Force RTX 3090 GPUs with 24GB of RAM. |
| Software Dependencies | No | The paper mentions using specific optimizers (Adam W) and model architectures (Llama 3, Llama 2, Phi 3), and a specific data format (bfloat16), but does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We employ the Adam W optimizer [29] for both PBFT and WILDA variants, with a learning rate of 10 4. In all of the experiments, we fine-tune the adapter for 10 epochs. Lo RA adapter configuration: Rank (r = 8), Scaling factor (α = 32), Dropout: (0.1), Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj. |