Provable Benefit of Cutout and CutMix for Feature Learning

Authors: Junsoo Oh, Chulhee Yun

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical Our theorems demonstrate that Cutout training can learn low-frequency features that vanilla training cannot, while Cut Mix training can learn even rarer features that Cutout cannot capture. From this, we establish that Cut Mix yields the highest test accuracy among the three. Our novel analysis reveals that Cut Mix training makes the network learn all features and noise vectors evenly regardless of the rarity and strength, which provides an interesting insight into understanding patch-level augmentation.
Researcher Affiliation Academia Junsoo Oh KAIST AI junsoo.oh@kaist.ac.kr Chulhee Yun KAIST AI chulhee.yun@kaist.ac.kr
Pseudocode No The paper describes mathematical derivations and algorithms but does not present them in a clearly labeled pseudocode or algorithm block.
Open Source Code No The paper does not provide an explicit statement or link for open-source code for the methodology described.
Open Datasets Yes We conduct experiments both in our setting and real-world data CIFAR-10 to support our theoretical findings and intuition.
Dataset Splits No The paper mentions 'training set' and 'test data' but does not explicitly detail validation splits. It states: 'Using a training set sampled from the distribution D, we would like to train our network f W to learn to correctly classify unseen data points from D.'
Hardware Specification Yes For all experiments described in this section and in Section 5, we use NVIDIA RTX A6000 GPUs.
Software Dependencies No The paper mentions using 'SGD' for optimization but does not provide specific version numbers for software libraries or dependencies used in the experiments.
Experiment Setup Yes For the numerical experiments on our setting, we set the number of patches P = 3, dimension d = 2000, number of data points n = 300, dominant noise strength σd = 0.25, background noise strength σb = 0.15, and feature noise strength α = 0.005. ... For the learner network, we set the slope of negative regime β = 0.1 and the length of the smoothed interval r = 1. We train models using three methods: ERM, Cutout, and Cut Mix with a learning rate η = 1.