A Recipe for Global Convergence Guarantee in Deep Neural Networks
Authors: Kenji Kawaguchi, Qingyun Sun8074-8082
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On the other hand, the expressivity condition is numerically shown to hold data-dependently for deep (convolutional) Res Net with batch normalization with various standard image datasets. We also show that the the proposed algorithm has generalization performances comparable with those of the heuristic algorithm, with the same hyper-parameters and total number of iterations. Therefore, the proposed algorithm can be viewed as a step towards providing theoretical guarantees for deep learning in the practical regime. 7 Experiments In this section, we study the empirical aspect of our method. |
| Researcher Affiliation | Academia | 1 Harvard University 2 Stanford University |
| Pseudocode | Yes | Algorithm 1 Two-phase modification A of a base algorithm with global convergence guarantees |
| Open Source Code | No | The paper does not contain any explicit statements about providing open-source code for the described methodology, nor does it include a link to a code repository. |
| Open Datasets | Yes | On the other hand, the expressivity condition is numerically shown to hold data-dependently for deep (convolutional) Res Net with batch normalization with various standard image datasets. [...] Table 1: Test errors (%) of base and A(base) with guarantee where the operator A maps any given first-order training algorithm to the two-phase version of the given algorithm with theoretical guarantees. The numbers indicate the mean test errors (and standard deviations in parentheses) over five random trials. The column of Augmentation shows No for no data augmentation, and Yes for data augmentation. The expressivity condition (assumption 1) was numerically verified to all datasets. Dataset # of training data Expressivity Condition Augmentation Base A(base) with guarantee MNIST 60000 [...] CIFAR-10 50000 [...] CIFAR-100 50000 [...] SVHN 73257 [...] Table 2: Test errors (%) of A(base) with guarantee for Kuzushiji-MNIST with different hyperparameters τ = τ0T and δ = δ0ϵ. The numbers indicate the mean test errors (and standard deviations in parentheses) over three random trials. The expressivity condition (Assumption 1) was numerically verified to hold for Kuzushiji-MNIST as well. |
| Dataset Splits | No | The paper mentions 'test and validation datasets' in the context of avoiding overfitting during hyperparameter tuning, stating they fixed hyperparameters a priori. However, it does not explicitly provide training, validation, and test dataset splits (e.g., percentages or sample counts) for the main experiments reported in Table 1 and Table 2. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, or cloud instances) used for running the experiments. |
| Software Dependencies | Yes | Table 1 column 3 summarizes the results of the verification of assumption 1 for various datasets. Here, we used a randomly sampled w(1:H) returned from the default initialization of the Res Net with version 1.4.0. of Py Torch (Paszke et al. 2019) by setting random seed to be 1. This initialization is based on the implementation of (He et al. 2015). The condition of rank([h(H) X (w(1:H)), 1n]) = n was checked by using numpy.linalg.matrix_rank in Num Py version 1.18.1 with the default option (i.e., without any arguments except the matrix [h(H) X (w(1:H)), 1n]), which uses the standard method from (Press et al. 2007). |
| Experiment Setup | Yes | Concretely, we fixed the mini-batch size to be 64, the weight decay rate to be 10 5, the momentum coefficient to be 0.9, the first phase learning rate to be ηt = 0.01 and the second phase learning rate to be ηt = 0.01 [0 d1:H, 1 d H+1] to only train the last layer. The last epoch T was fixed a priori as T = 100 without data augmentation and T = 400 with data augmentation. [...] Based on the results from Kuzushiji-MNIST in Table 2, we fixed τ0 = 0.6 and δ0 = 0.001 for all datasets. |