Scarf: Self-Supervised Contrastive Learning using Random Feature Corruption

Authors: Dara Bahri, Heinrich Jiang, Yi Tay, Donald Metzler

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose SCARF, a simple, widely-applicable technique for contrastive learning, where views are formed by corrupting a random subset of features. When applied to pre-train deep neural networks on the 69 real-world, tabular classification datasets from the Open ML-CC18 benchmark, SCARF not only improves classification accuracy in the fully-supervised setting but does so also in the presence of label noise and in the semi-supervised setting where only a fraction of the available training data is labeled. We show that SCARF complements existing strategies and outperforms alternatives like autoencoders. We conduct comprehensive ablations, detailing the importance of a range of factors.
Researcher Affiliation Industry Dara Bahri, Heinrich Jiang, Yi Tay, Donald Metzler Google Research {dbahri,heinrichj,yitay,metzler}@google.com
Pseudocode Yes Algorithm 1 SCARF pre-training algorithm.
Open Source Code No The paper does not provide an explicit statement or link for open-source code for the described methodology.
Open Datasets Yes We use 69 datasets from the public Open ML-CC18 benchmark1 under the CC-BY licence. It consists of 72 real-world classification datasets that have been manually curated for effective benchmarking.
Dataset Splits Yes For each Open ML dataset, we form 70%/10%/20% train/validation/test splits, where a different split is generated for every trial and all methods use the same splits.
Hardware Specification No Experiments were run on a cloud cluster of CPUs, and we used about one million CPU core hours in total for the experiments. (The description 'cloud cluster of CPUs' is too general and does not provide specific model numbers or detailed specifications.)
Software Dependencies Yes We use the Python API 2, version 0.6, and choose the default settings for XGBClassifier (max depth of 3, 100 estimators, learning rate of 0.1).
Experiment Setup Yes We choose all three component models to be Re LU networks with hidden dimension 256. f consists of 4 layers, whereas both g and h have 2 layers. Both SCARF and the autoencoder baselines use g (for both pre-training and co-training, described later), but for autoencoders, the output dimensionality is the input feature dimensionality, and the mean-squared error reconstruction loss is applied. We train all models and their components with the Adam optimizer using the default learning rate of 0.001. For both pre-training and fine-tuning we use 128 batch size. Unsupervised pre-training methods all use early stopping with patience 3 on the validation loss, unless otherwise noted. Supervised fine-tuning uses this same criterion (and validation split), but classification error is used as the validation metric for early stopping, as it performs slightly better. We set a max number of fine-tune epochs of 200 and pre-train epochs of 1000, We use 10 epochs to build the static validation set. Unless otherwise noted, we use a corruption rate c of 0.6 and a temperature τ of 1, for SCARF-based methods. All runs are repeated 30 times using different train/validation/test splits.