Provable Benefit of Cutout and CutMix for Feature Learning
Authors: Junsoo Oh, Chulhee Yun
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | Our theorems demonstrate that Cutout training can learn low-frequency features that vanilla training cannot, while Cut Mix training can learn even rarer features that Cutout cannot capture. From this, we establish that Cut Mix yields the highest test accuracy among the three. Our novel analysis reveals that Cut Mix training makes the network learn all features and noise vectors evenly regardless of the rarity and strength, which provides an interesting insight into understanding patch-level augmentation. |
| Researcher Affiliation | Academia | Junsoo Oh KAIST AI junsoo.oh@kaist.ac.kr Chulhee Yun KAIST AI chulhee.yun@kaist.ac.kr |
| Pseudocode | No | The paper describes mathematical derivations and algorithms but does not present them in a clearly labeled pseudocode or algorithm block. |
| Open Source Code | No | The paper does not provide an explicit statement or link for open-source code for the methodology described. |
| Open Datasets | Yes | We conduct experiments both in our setting and real-world data CIFAR-10 to support our theoretical findings and intuition. |
| Dataset Splits | No | The paper mentions 'training set' and 'test data' but does not explicitly detail validation splits. It states: 'Using a training set sampled from the distribution D, we would like to train our network f W to learn to correctly classify unseen data points from D.' |
| Hardware Specification | Yes | For all experiments described in this section and in Section 5, we use NVIDIA RTX A6000 GPUs. |
| Software Dependencies | No | The paper mentions using 'SGD' for optimization but does not provide specific version numbers for software libraries or dependencies used in the experiments. |
| Experiment Setup | Yes | For the numerical experiments on our setting, we set the number of patches P = 3, dimension d = 2000, number of data points n = 300, dominant noise strength σd = 0.25, background noise strength σb = 0.15, and feature noise strength α = 0.005. ... For the learner network, we set the slope of negative regime β = 0.1 and the length of the smoothed interval r = 1. We train models using three methods: ERM, Cutout, and Cut Mix with a learning rate η = 1. |