Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

NeuronTune: Towards Self-Guided Spurious Bias Mitigation

Authors: Guangtao Zheng, Wenqian Ye, Aidong Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments across different architectures and data modalities demonstrate that our method significantly mitigates spurious bias in a self-guided way. Experiments on vision and text datasets with different model architectures confirm the effectiveness of our method. (Section 1, Introduction) Section 5 is titled "Experiments" and includes details on datasets (Waterbirds, Celeb A, Image Net-9, Image Net-A, Multi NLI, Civil Comments), experimental setup, and comparison tables (Table 1, 2, 3, 4, 5).
Researcher Affiliation	Academia	1Department of Computer Science, University of Virginia, Charlottesville, VA, USA. Correspondence to: Guangtao Zheng <EMAIL>.
Pseudocode	No	The paper describes the practical implementation of Neuron Tune in Section 4.2 with a flowchart in Figure 1. However, it does not include a clearly labeled pseudocode or algorithm block with structured steps.
Open Source Code	Yes	Code is available at https://github.com/gtzheng/Neuron Tune.
Open Datasets	Yes	We tested Neuron Tune on four image datasets and two text datasets, each with different types of spurious attributes. (1) Waterbirds (Sagawa et al., 2019) is an image dataset... (2) Celeb A (Liu et al., 2015) is a large-scale image dataset... (3) Image Net-9 (Xiao et al., 2021) is a subset of Image Net (Deng et al., 2009)... (4) Image Net-A (Hendrycks et al., 2021) is a dataset of real-world images... (5) Multi NLI (Williams et al., 2018) is a text classification dataset... (6) Civil Comments (Borkan et al., 2019) is a binary text classification dataset...
Dataset Splits	Yes	The dataset uses standard splits provided by the WILDS benchmark (Koh et al., 2021). (Section 5.1) Table 8. Numbers of samples in different groups and different splits of the four datasets. (Appendix A.7) We divided Dval into two equal halves: one half (denoted as Dval/2) was used as DIde, while the other half served as DTune. (Section 5.5)
Hardware Specification	No	The paper mentions using Res Net-50, Res Net-18, and BERT models, which are deep neural networks, implying the use of powerful computational resources. However, it does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for conducting the experiments.
Software Dependencies	No	The paper mentions using 'Res Net-50', 'Res Net-18', and 'BERT model' as backbones and 'SGD', 'Adam W' as optimizers. It also refers to 'Cosine Annealing' and 'Linear' learning rate schedulers. However, it does not specify any software libraries (e.g., PyTorch, TensorFlow) or their version numbers, nor the version of the programming language used.
Experiment Setup	Yes	Table 9. Hyperparameters for ERM training. (Appendix A.8) Table 10. Hyperparameters for Neuron Tune. (Appendix A.8) These tables provide specific details such as initial learning rate, number of epochs, learning rate scheduler, optimizer, backbone, weight decay, and batch size for different datasets.