Domino: Discovering Systematic Errors with Cross-Modal Embeddings
Authors: Sabri Eyuboglu, Maya Varma, Khaled Kamal Saab, Jean-Benoit Delbrouck, Christopher Lee-Messer, Jared Dunnmon, James Zou, Christopher Re
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we address these challenges by first designing a principled evaluation framework that enables a quantitative comparison of SDMs across 1,235 slice discovery settings in three input domains (natural images, medical images, and time-series data). Then, motivated by the recent development of powerful cross-modal representation learning approaches, we present Domino, an SDM that leverages cross-modal embeddings and a novel error-aware mixture model to discover and describe coherent slices. We find that Domino accurately identifies 36% of the 1,235 slices in our framework a 12 percentage point improvement over prior methods. |
| Researcher Affiliation | Academia | Stanford University, USA; {eyuboglu,mvarma2,ksaab}@stanford.edu |
| Pseudocode | Yes | Algorithm 1 SDM Evaluation Process |
| Open Source Code | Yes | We provide an open-source implementation of our evaluation framework at https://github. com/Hazy Research/domino. Users can run Domino on their own models and datasets by installing our Python package via: pip install domino. |
| Open Datasets | Yes | We obtain a dataset of short 12 second electroencephalography (EEG) signals, which have been used in prior work for predicting the onset of seizures (Saab et al., 2020). ... The Celeb Faces Attributes Dataset (Celeb A) includes over 200k images with 40 labeled attributes (Liu et al., 2015). ImageNet includes 1.2 million images across 1000 labeled classes organized in a hierarchical structure (Deng et al., 2009; Fellbaum, 1998). ... The MIMIC Chest X-Ray (MIMIC-CXR) dataset includes 377,110 chest x-rays collected from the Beth Israel Deaconess Medical Center. Annotations indicate the presence or absence of fourteen conditions (Johnson et al., 2019; 2020). |
| Dataset Splits | No | The paper states that 'validation' data is used (e.g., 'We assume that training, validation and test data are drawn independently and identically from this distribution.' in Section 3, and 'Fit the SDM on the validation set' in Algorithm 1, Section 4). It also mentions 'early stopping using the validation dataset' in Section A.3.3. However, it does not provide specific percentages or counts for the validation split. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for running the experiments. It mentions the types of models (e.g., ResNet-18, densely connected inception convolution neural network) but not the underlying hardware (e.g., specific GPUs, CPUs, or memory). |
| Software Dependencies | No | The paper mentions several software components and tools such as 'Adam optimizer', 'BERT-based transformer', 'CheXBert', 'ViT', 'CLIP', and 'ViLMedic'. However, it does not provide specific version numbers for these software dependencies, which would be necessary for reproducibility. |
| Experiment Setup | Yes | For our natural image settings and medical image settings, we used a Res Net-18 randomly initialized with He initialization (He et al., 2015; 2016). We applied an Adam optimizer with learning rate 1 × 10−4 for 10 epochs and use early stopping using the validation dataset (Kingma & Ba, 2017). During training, we randomly crop each image, resize to 224 × 224, apply a random horizontal flip, and normalize using Image Net mean and standard deviation (µ = [0.485, 0.456, 0.406], σ = [0.229, 0.224, 0.225]). ... For our medical time series settings, we use a densely connected inception convolution neural network (Roy et al., 2019) randomly initialized with He initialization (He et al., 2015; 2016). Since the EEG signals are sampled at 200 Hz, and the EEG clip length is 12 seconds, with 19 EEG electrodes, the input EEG has shape 19 × 2400. The models are trained with a learning rate of 10−6 and a batch size of 16 for 15 epochs. ... We train our implementation for 30 epochs with a learning rate of 10−4, a batch size of 64, and an embedding dimension of 256. The training process comes to an early stop if the loss fails to decrease for ten epochs. ... The cross-modal model is trained with a learning rate of 10−6, an embedding dimension of 128, and a batch size of 32 for 200 epochs. |