Tilting the playing field: Dynamical loss functions for machine learning

Authors: Miguel Ruiz-Garcia, Ge Zhang, Samuel S Schoenholz, Andrea J. Liu

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that learning can be improved by using loss functions that evolve cyclically during training to emphasize one class at a time. In underparameterized networks, such dynamical loss functions can lead to successful training for networks that fail to find deep minima of the standard cross-entropy loss. In overparameterized networks, dynamical loss functions can lead to better generalization. Improvement arises from the interplay of the changing loss landscape with the dynamics of the system as it evolves to minimize the loss. In particular, as the loss function oscillates, instabilities develop in the form of bifurcation cascades, which we study using the Hessian and Neural Tangant Kernel. Valleys in the landscape widen and deepen, and then narrow and rise as the loss landscape changes during a cycle. As the landscape narrows, the learning rate becomes too large and the network becomes unstable and bounces around the valley. This process ultimately pushes the system into deeper and wider regions of the loss landscape and is characterized by decreasing eigenvalues of the Hessian. This results in better regularized models with improved generalization performance.
Researcher Affiliation Collaboration 1Department of Physics and Astronomy, University of Pennsylvania, Philadelphia, PA, USA 2Department of Applied Mathematics, ETSII, Universidad Polit ecnica de Madrid, Madrid, Spain 3Google Research: Brain Team. Correspondence to: Miguel Ruiz-Garcia <miguel.ruiz.garcia@uc3m.es>.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Code reproducing our main results can be found at https://github.com/miguel-rg/dynamical-loss-functions.
Open Datasets Yes To test the effect of oscillations on the outcome of training, we use CIFAR10 as a benchmark, without data augmentation.
Dataset Splits No The paper uses validation accuracy and mentions a training and validation dataset for the spiral dataset ('the validation dataset is analogous to it but with a different distribution of the points along the arms'). However, it does not provide specific split percentages, sample counts, or references to predefined splits for either dataset to enable reproduction of the data partitioning.
Hardware Specification No The paper states: 'MRG and GZ acknowledge support from the Extreme Science and Engineering Discovery Environment (XSEDE) (Towns et al., 2014) to use Bridges-2 GPU-AI at the Pittsburgh Supercomputing Center (PSC) through allocation TG-PHY190040.' While it mentions a computing center and 'GPU-AI', it does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts.
Software Dependencies No The paper mentions key software components: 'In all of the experiments we use JAX (Bradbury et al., 2018) for training, Neural Tangents for computation of the NTK (Novak et al., 2020), and an open source implementation of the Lanczos algorithm for estimating the spectrum of the Hessian (Ghorbani et al., 2019a).' However, it does not provide specific version numbers for JAX or Neural Tangents, which are necessary for reproducible software dependencies.
Experiment Setup Yes We used 64 channels, Nesterov optimizer with momentum = 0.9, minibatch size 512, a linear learning rate schedule starting at 0, reaching 0.02 in the epoch 300 and decreasing to 0.002 in the final epoch (700). For all A and T the oscillations stopped at epoch 600 (see the Supplementary Materials for more details).