Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

HiBug: On Human-Interpretable Model Debug

Authors: Muxi Chen, YU LI, Qiang Xu

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments on dcbench [Eyuboglu et al., 2022] show that Hi Bug can identify up to 85% correlation errors and 81% rare cases. We also use three different tasks to show that Hi Bug benefits data selection and data generation for model improvement.
Researcher Affiliation	Academia	The Chinese University of Hong Kong Harbin Institute of Technology, Shenzhen EMAIL; EMAIL
Pseudocode	No	The paper describes algorithms in text (e.g., in section 3.3 for data selection), but does not provide structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at: https://github.com/cure-lab/Hi Bug.
Open Datasets	Yes	The correlation discovery task in dcbench consists of 880 problems. Each problem contains common materials used in image classification, such as a model checkpoint, the model s predictions on a validation set, and labels for the validation set. Notably, each problem also includes a description of the erroneous correlation, indicating the name of the feature that the model s prediction is correlated with. For example, the prediction of a human-related classification model can be correlated with gender. The rare case discovery task in dcbench contains 118 problems. Apart from the common materials, each problem also has the name of a rare case.
Dataset Splits	Yes	Table 2: Basic information of three experiment settings. Split denotes the number of data in train set: validation set: test set: unlabeled pool. Dataset Lipstick RPC Image Net10 Number of classes 2 200 10 Split 80k:10k:10k:80k 54k:25k:74k:49k 10k:28k:2k:10k
Hardware Specification	Yes	In our experiments, we utilized an Intel(R) Xeon(R) Silver 4310 CPU @ 2.10GHz and an NVIDIA Ge Force RTX 3090 GPU.
Software Dependencies	No	The paper mentions using specific models like Chat GPT and BLIP, and frameworks like Stable Diffusion, but it does not specify version numbers for these or other software dependencies.
Experiment Setup	Yes	Experiment setup. In continuation of the previous experiment, we extend our evaluation to ascertain whether the attribute values of bug slices can be further employed for data selection and data generation, ultimately contributing to model enhancement.