HiBug: On Human-Interpretable Model Debug

Authors: Muxi Chen, YU LI, Qiang Xu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on dcbench [Eyuboglu et al., 2022] show that Hi Bug can identify up to 85% correlation errors and 81% rare cases. We also use three different tasks to show that Hi Bug benefits data selection and data generation for model improvement.
Researcher Affiliation Academia The Chinese University of Hong Kong Harbin Institute of Technology, Shenzhen {mxchen21,qxu}@cse.cuhk.edu.hk; li.yu@hit.edu.cn
Pseudocode No The paper describes algorithms in text (e.g., in section 3.3 for data selection), but does not provide structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at: https://github.com/cure-lab/Hi Bug.
Open Datasets Yes The correlation discovery task in dcbench consists of 880 problems. Each problem contains common materials used in image classification, such as a model checkpoint, the model s predictions on a validation set, and labels for the validation set. Notably, each problem also includes a description of the erroneous correlation, indicating the name of the feature that the model s prediction is correlated with. For example, the prediction of a human-related classification model can be correlated with gender. The rare case discovery task in dcbench contains 118 problems. Apart from the common materials, each problem also has the name of a rare case.
Dataset Splits Yes Table 2: Basic information of three experiment settings. Split denotes the number of data in train set: validation set: test set: unlabeled pool. Dataset Lipstick RPC Image Net10 Number of classes 2 200 10 Split 80k:10k:10k:80k 54k:25k:74k:49k 10k:28k:2k:10k
Hardware Specification Yes In our experiments, we utilized an Intel(R) Xeon(R) Silver 4310 CPU @ 2.10GHz and an NVIDIA Ge Force RTX 3090 GPU.
Software Dependencies No The paper mentions using specific models like Chat GPT and BLIP, and frameworks like Stable Diffusion, but it does not specify version numbers for these or other software dependencies.
Experiment Setup Yes Experiment setup. In continuation of the previous experiment, we extend our evaluation to ascertain whether the attribute values of bug slices can be further employed for data selection and data generation, ultimately contributing to model enhancement.