reproducibilityindex.ai

Domino: Discovering Systematic Errors with Cross-Modal Embeddings

Authors: Sabri Eyuboglu, Maya Varma, Khaled Kamal Saab, Jean-Benoit Delbrouck, Christopher Lee-Messer, Jared Dunnmon, James Zou, Christopher Re

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we address these challenges by ﬁrst designing a principled evaluation framework that enables a quantitative comparison of SDMs across 1,235 slice discovery settings in three input domains (natural images, medical images, and time-series data). Then, motivated by the recent development of powerful cross-modal representation learning approaches, we present Domino, an SDM that leverages cross-modal embeddings and a novel error-aware mixture model to discover and describe coherent slices. We ﬁnd that Domino accurately identiﬁes 36% of the 1,235 slices in our framework a 12 percentage point improvement over prior methods.
Researcher Affiliation	Academia	Stanford University, USA; {eyuboglu,mvarma2,ksaab}@stanford.edu
Pseudocode	Yes	Algorithm 1 SDM Evaluation Process
Open Source Code	Yes	We provide an open-source implementation of our evaluation framework at https://github. com/Hazy Research/domino. Users can run Domino on their own models and datasets by installing our Python package via: pip install domino.
Open Datasets	Yes	We obtain a dataset of short 12 second electroencephalography (EEG) signals, which have been used in prior work for predicting the onset of seizures (Saab et al., 2020). ... The Celeb Faces Attributes Dataset (Celeb A) includes over 200k images with 40 labeled attributes (Liu et al., 2015). ImageNet includes 1.2 million images across 1000 labeled classes organized in a hierarchical structure (Deng et al., 2009; Fellbaum, 1998). ... The MIMIC Chest X-Ray (MIMIC-CXR) dataset includes 377,110 chest x-rays collected from the Beth Israel Deaconess Medical Center. Annotations indicate the presence or absence of fourteen conditions (Johnson et al., 2019; 2020).
Dataset Splits	No	The paper states that 'validation' data is used (e.g., 'We assume that training, validation and test data are drawn independently and identically from this distribution.' in Section 3, and 'Fit the SDM on the validation set' in Algorithm 1, Section 4). It also mentions 'early stopping using the validation dataset' in Section A.3.3. However, it does not provide specific percentages or counts for the validation split.
Hardware Specification	No	The paper does not explicitly describe the hardware used for running the experiments. It mentions the types of models (e.g., ResNet-18, densely connected inception convolution neural network) but not the underlying hardware (e.g., specific GPUs, CPUs, or memory).
Software Dependencies	No	The paper mentions several software components and tools such as 'Adam optimizer', 'BERT-based transformer', 'CheXBert', 'ViT', 'CLIP', and 'ViLMedic'. However, it does not provide specific version numbers for these software dependencies, which would be necessary for reproducibility.
Experiment Setup	Yes	For our natural image settings and medical image settings, we used a Res Net-18 randomly initialized with He initialization (He et al., 2015; 2016). We applied an Adam optimizer with learning rate 1 × 10−4 for 10 epochs and use early stopping using the validation dataset (Kingma & Ba, 2017). During training, we randomly crop each image, resize to 224 × 224, apply a random horizontal ﬂip, and normalize using Image Net mean and standard deviation (µ = [0.485, 0.456, 0.406], σ = [0.229, 0.224, 0.225]). ... For our medical time series settings, we use a densely connected inception convolution neural network (Roy et al., 2019) randomly initialized with He initialization (He et al., 2015; 2016). Since the EEG signals are sampled at 200 Hz, and the EEG clip length is 12 seconds, with 19 EEG electrodes, the input EEG has shape 19 × 2400. The models are trained with a learning rate of 10−6 and a batch size of 16 for 15 epochs. ... We train our implementation for 30 epochs with a learning rate of 10−4, a batch size of 64, and an embedding dimension of 256. The training process comes to an early stop if the loss fails to decrease for ten epochs. ... The cross-modal model is trained with a learning rate of 10−6, an embedding dimension of 128, and a batch size of 32 for 200 epochs.