Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Mitigating Spurious Correlations via Disagreement Probability
Authors: Hyeonggeun Han, Sehwan Kim, Hyungjun Joo, Sangwoo Hong, Jungwoo Lee
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations on multiple benchmarks demonstrate that DPR achieves state-of-the-art performance over existing baselines that do not use bias labels. |
| Researcher Affiliation | Collaboration | 1ECE & 2Next Quantum, Seoul National University 3Hodoo AI Labs EMAIL |
| Pseudocode | Yes | Algorithm 1: Disagreement Probability based Resampling for debiasing (DPR) |
| Open Source Code | Yes | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] |
| Open Datasets | Yes | Colored MNIST (C-MNIST) is a synthetic dataset designed for digit classification, comprising ten digits, each spuriously correlated with a specific color. Following the protocols in Ahn et al. [1], we set the ratios of bias-conflicting samples, denoted as ρ, at 0.5%, 1%, and 5% for the training set, and 90% for the unbiased test set. |
| Dataset Splits | Yes | Additionally, we use a 10% of training data as validation data, and an unbiased test set with a bias-conflicting ratio of 90% is employed for performance evaluation. |
| Hardware Specification | Yes | All classification models are trained using an NVIDIA RTX A6000. |
| Software Dependencies | No | The paper mentions optimizers (SGD, Adam, Adam W) and models (BERT) but does not provide specific version numbers for any software dependencies (e.g., library or framework versions like PyTorch 1.9). |
| Experiment Setup | Yes | We train the model for 100 epochs with SGD optimizer, a batch size of 128, a learning rate of 0.02, weight decay of 0.001, momentum of 0.9, and learning rate decay of 0.1 at every 40 steps. |