Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Disentangling Latent Shifts of In-Context Learning with Weak Supervision

Authors: Josip Jukić, Jan Šnajder

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, our method improves generalization, stability, and efficiency across both in-domain and out-of-domain tasks, surpassing standard ICL and prior disentanglement methods.
Researcher Affiliation	Academia	Josip Juki c, Jan Šnajder Take Lab Faculty of Electrical Engineering and Computing University of Zagreb, Croatia EMAIL
Pseudocode	No	The paper describes methods and processes in paragraph form, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code	Yes	Our code is available at https://github.com/josipjukic/wilda.
Open Datasets	Yes	We assess model performance on seven tasks from the GLUE benchmark [37], covering single-sequence binary classification (COLA, SST, RTE), sequence-pair binary classification (MRPC, QQP, QNLI), and sequence-pair multi-class classification (MNLI). ... Additionally, we measure accuracy on selected datasets from the MMLU benchmark [14], specifically elementary math (MATH) and miscellaneous (MISC). We further extend our analysis to the ARC-Challenge benchmark [7] to assess reasoning and multi-hop generalization, which we show in Appendix D.
Dataset Splits	Yes	Evaluations for GLUE are conducted on the development sets, whereas for the MMLU datasets, we randomly sample 200 instances for evaluation. Additionally, for GLUE datasets, we experiment with 200 and 500 instances to assess the impact of the amount of unlabeled data on generalization and stability. We experiment only with 100 unlabeled instances for MMLU datasets due to their limited size.
Hardware Specification	Yes	We conducted our experiments on AMD Ryzen Threadripper 3970X 32-Core Processors and 4 NVIDIA Ge Force RTX 3090 GPUs with 24GB of RAM.
Software Dependencies	No	The paper mentions using specific optimizers (Adam W) and model architectures (Llama 3, Llama 2, Phi 3), and a specific data format (bfloat16), but does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup	Yes	We employ the Adam W optimizer [29] for both PBFT and WILDA variants, with a learning rate of 10 4. In all of the experiments, we fine-tune the adapter for 10 epochs. Lo RA adapter configuration: Rank (r = 8), Scaling factor (α = 32), Dropout: (0.1), Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj.