Simplicity Bias of Two-Layer Networks beyond Linearly Separable Data
Authors: Nikita Tsoy, Nikola Konstantinov
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we characterize simplicity bias for general datasets in the context of two-layer neural networks initialized with small weights and trained with gradient flow. Specifically, we prove that in the early training phases, network features cluster around a few directions that do not depend on the size of the hidden layer. Furthermore, for datasets with an XOR-like pattern, we precisely identify the learned features and demonstrate that simplicity bias intensifies during later training stages. These results indicate that features learned in the middle stages of training may be more useful for OOD transfer. We support this hypothesis with experiments on image data. |
| Researcher Affiliation | Academia | 1INSAIT, Sofia University, Bulgaria. |
| Pseudocode | No | The paper contains mathematical proofs and descriptions of dynamics but does not include any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Replication files are available at https://github.com/nikita-tsoy98/simplicity-bias-beyond-linear-replication |
| Open Datasets | Yes | We test this hypothesis on the MNIST-CIFAR10 domino dataset proposed by Shah et al. (2020). |
| Dataset Splits | Yes | We further devoted 25% train and test data for validation, giving us four datasets: train-train, train-validation, test-train, and test-validation. |
| Hardware Specification | No | The paper mentions training a "Res Net-18 model" and that experiments ran, but it does not specify any particular hardware components like CPU models, GPU models (e.g., NVIDIA A100), or memory specifications. |
| Software Dependencies | No | The paper mentions using "PyTorch" and the "Transformers library" for the learning scheduler, but it does not specify version numbers for these software components, which is necessary for reproducibility. |
| Experiment Setup | Yes | batch size 128 lr 0.125 momentum 0.9 nesterov True weight decay 0.0005 Share of warm-up steps 12.5% |