A Recipe for Global Convergence Guarantee in Deep Neural Networks

Authors: Kenji Kawaguchi, Qingyun Sun8074-8082

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On the other hand, the expressivity condition is numerically shown to hold data-dependently for deep (convolutional) Res Net with batch normalization with various standard image datasets. We also show that the the proposed algorithm has generalization performances comparable with those of the heuristic algorithm, with the same hyper-parameters and total number of iterations. Therefore, the proposed algorithm can be viewed as a step towards providing theoretical guarantees for deep learning in the practical regime. 7 Experiments In this section, we study the empirical aspect of our method.
Researcher Affiliation Academia 1 Harvard University 2 Stanford University
Pseudocode Yes Algorithm 1 Two-phase modification A of a base algorithm with global convergence guarantees
Open Source Code No The paper does not contain any explicit statements about providing open-source code for the described methodology, nor does it include a link to a code repository.
Open Datasets Yes On the other hand, the expressivity condition is numerically shown to hold data-dependently for deep (convolutional) Res Net with batch normalization with various standard image datasets. [...] Table 1: Test errors (%) of base and A(base) with guarantee where the operator A maps any given first-order training algorithm to the two-phase version of the given algorithm with theoretical guarantees. The numbers indicate the mean test errors (and standard deviations in parentheses) over five random trials. The column of Augmentation shows No for no data augmentation, and Yes for data augmentation. The expressivity condition (assumption 1) was numerically verified to all datasets. Dataset # of training data Expressivity Condition Augmentation Base A(base) with guarantee MNIST 60000 [...] CIFAR-10 50000 [...] CIFAR-100 50000 [...] SVHN 73257 [...] Table 2: Test errors (%) of A(base) with guarantee for Kuzushiji-MNIST with different hyperparameters τ = τ0T and δ = δ0ϵ. The numbers indicate the mean test errors (and standard deviations in parentheses) over three random trials. The expressivity condition (Assumption 1) was numerically verified to hold for Kuzushiji-MNIST as well.
Dataset Splits No The paper mentions 'test and validation datasets' in the context of avoiding overfitting during hyperparameter tuning, stating they fixed hyperparameters a priori. However, it does not explicitly provide training, validation, and test dataset splits (e.g., percentages or sample counts) for the main experiments reported in Table 1 and Table 2.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, or cloud instances) used for running the experiments.
Software Dependencies Yes Table 1 column 3 summarizes the results of the verification of assumption 1 for various datasets. Here, we used a randomly sampled w(1:H) returned from the default initialization of the Res Net with version 1.4.0. of Py Torch (Paszke et al. 2019) by setting random seed to be 1. This initialization is based on the implementation of (He et al. 2015). The condition of rank([h(H) X (w(1:H)), 1n]) = n was checked by using numpy.linalg.matrix_rank in Num Py version 1.18.1 with the default option (i.e., without any arguments except the matrix [h(H) X (w(1:H)), 1n]), which uses the standard method from (Press et al. 2007).
Experiment Setup Yes Concretely, we fixed the mini-batch size to be 64, the weight decay rate to be 10 5, the momentum coefficient to be 0.9, the first phase learning rate to be ηt = 0.01 and the second phase learning rate to be ηt = 0.01 [0 d1:H, 1 d H+1] to only train the last layer. The last epoch T was fixed a priori as T = 100 without data augmentation and T = 400 with data augmentation. [...] Based on the results from Kuzushiji-MNIST in Table 2, we fixed τ0 = 0.6 and δ0 = 0.001 for all datasets.