Diagnosing and Rectifying Vision Models using Language

Authors: Yuhui Zhang, Jeff Z. HaoChen, Shih-Cheng Huang, Kuan-Chieh Wang, James Zou, Serena Yeung

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On a range of image datasets with known error slices, we demonstrate that our method can effectively identify the error slices and influential attributes, and can further use language to rectify failure modes of the classifier. and In this section, we first demonstrate that text embeddings are good proxies for image embeddings in multi-modal contrastive representation space (Section 3.2). Based on that, we demonstrate how Dr ML successfully discovers error slices (Section 3.3), identifies influential attributes (Section 3.4), and further rectifies model misbehaviors on three datasets (Section 3.5).
Researcher Affiliation Academia Yuhui Zhang, Jeff Z. Hao Chen, Shih-Cheng Huang, Kuan-Chieh Wang, James Zou, Serena Yeung Stanford University, Stanford, CA 94305, USA {yuhuiz, jhaochen, mschuang, wangkua1, jamesz, syyeung}@stanford.edu
Pseudocode No The paper describes its methods through text and mathematical formulas but does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code Yes We provide open-source implementation of our work at https://github.com/ yuhui-zh15/drml. The implementations will enable researchers to reproduce all the experiments described here as well as run their own analyses on additional multi-modal models and datasets.
Open Datasets Yes For cross-modality transferability (Section 3.2), we use the MS-COCO dataset (Lin et al., 2014)... For model diagnosis and rectification, we simulate the three common types of model failures. For spurious correlation, we use the Waterbirds dataset (Sagawa et al., 2020)... For underrepresented data, we use Fair Faces (Karkkainen & Joo, 2021)... For unseen data, we use d Sprites V (Matthey et al., 2017)...
Dataset Splits Yes MS-COCO. We follow the standard MS-COCO dataset split, which includes 118K / 5K images for training / validation. and Waterbirds. We follow the standard Waterbirds dataset split, which includes 4.8K / 1.2K images for training / validation. and Fair Face. The final dataset contains 17K / 11K images for training / validation. and d Sprites V. Finally, it has 1.3K / 8.7K images for training / validation.
Hardware Specification No The paper does not provide specific details regarding the hardware used for running experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions using "CLIP (Radford et al., 2021)" but does not specify a version number for CLIP or any other software dependencies.
Experiment Setup Yes We train the linear model or multi-layer perception for 25 epochs using the Adam optimizer with a fixed learning rate of 0.001. (Appendix B.5, Cross-modal Transferability Training Details) and We continue training the pre-trained linear model or multi-layer perception for 10 epochs using the Adam optimizer with a fixed learning rate of 0.001. (Appendix B.5, Model Rectification Training Details) and We reproduce GDRO on our datasets by adopting the official GDRO loss implementation to our code base. We use all the same hyperparameters they use in the paper, where important hyperparameters include l2 penalty strength α = 0.2 and group adjustment γ = 0.1. (Appendix C.3) and We also perform a hyperparameter search on the upsampling weight λup {5, 20, 50}, which is a very important hyperparameter based on the paper. The best λup is 20 for Waterbirds and 5 for Fair Face. (Appendix C.3).