Adding seemingly uninformative labels helps in low data regimes

Authors: Christos Matsoukas, Albert Bou Hernandez, Yue Liu, Karin Dembrower, Gisele Miranda, Emir Konuk, Johan Fredin Haslum, Athanasios Zouzos, Peter Lindholm, Fredrik Strand, Kevin Smith

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that, in low-data settings, performance can be improved by complementing the expert annotations with seemingly uninformative labels, turning the task into a multi-class problem. We demonstrate our findings on CSAW-S, a new dataset that we introduce here, and confirm them on two public datasets.
Researcher Affiliation Collaboration 1School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, Stockholm, Sweden 2Science for Life Laboratory, Stockholm, Sweden 3Astra Zeneca, Gothenburg, Sweden 4Karolinska Institutet, Stockholm, Sweden 5Capio Sankt G oran Hospital, Stockholm, Sweden 6Karolinska University Hospital, Stockholm, Sweden.
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes Finally, to promote transparency and reproducibility, we share our open-source code, available at github.com/Chris Mats/seemingly uninformative labels and CSAW-S at https://github.com/Chris Mats/CSAW-S.
Open Datasets Yes We release the CSAW-S dataset used in this study to the public, which contains valuable mammography images with labels from multiple experts and non-experts that can be used to replicate our study and for other segmentation tasks. [...] Finally, to promote transparency and reproducibility, we share our open-source code, available at github.com/Chris Mats/seemingly uninformative labels and CSAW-S at https://github.com/Chris Mats/CSAW-S. [...] We validate our findings by demonstrating that the observed effect holds in other domains, using public datasets including CITYSCAPES and PASCAL VOC in Section 5.
Dataset Splits Yes The patients are split into a test set of 26 images from 23 patients and training/validation set containing 312 images from 150 patients. [...] We split the train/validation sets by patient, 130/20. This resulted in 263/49 images per set. [...] We randomly select 500 images from the official training set to use as the validation set, and we use the rest for training. [...] We used the official train-2012 as training set and we sampled 500 images from the test-2012 as our validation set.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running experiments were mentioned.
Software Dependencies No The paper mentions software components like Deep Lab3, Res Net50, and ADAM optimizer, but does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes We use Deep Lab3 (Chen et al., 2017) with Res Net50 (He et al., 2016) as the backbone for all experiments. Following He et al. and Raghu et al., we initialize all models with IMAGENET pretrained weights and we replace Batch Norm layers with Group Norm layers (Wu & He, 2018). We use an ADAM (Kingma & Ba, 2014) optimizer throughout our experiments. Due to memory limitations and the high resolution of mammograms, we train using 512 512 patches. To ensure good representation in the training data, for every full image we sample a center-cropped patch from 10 random locations belonging to each of the 12 classes (the same for training with and without complementary labels). To alleviate overfitting issues associated with extreme low data regimes, we employ an extensive set of augmentations including rotations and elastic transformation in addition to standard random flips, random crops of 448 448, random brightness and random contrast augmentations. We report results for each run using the best checkpoint model. Since the cross entropy loss does not precisely represent the Io U metric we consider both the validation Io U and loss when selecting the best model. For all of our experiments we fine-tuned the learning rate for each setting and the results are averaged over 5 runs, unless otherwise specified.