reproducibilityindex.ai

A Recipe for Global Convergence Guarantee in Deep Neural Networks

Authors: Kenji Kawaguchi, Qingyun Sun8074-8082

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On the other hand, the expressivity condition is numerically shown to hold data-dependently for deep (convolutional) Res Net with batch normalization with various standard image datasets. We also show that the the proposed algorithm has generalization performances comparable with those of the heuristic algorithm, with the same hyper-parameters and total number of iterations. Therefore, the proposed algorithm can be viewed as a step towards providing theoretical guarantees for deep learning in the practical regime. 7 Experiments In this section, we study the empirical aspect of our method.
Researcher Affiliation	Academia	1 Harvard University 2 Stanford University
Pseudocode	Yes	Algorithm 1 Two-phase modiﬁcation A of a base algorithm with global convergence guarantees
Open Source Code	No	The paper does not contain any explicit statements about providing open-source code for the described methodology, nor does it include a link to a code repository.
Open Datasets	Yes	On the other hand, the expressivity condition is numerically shown to hold data-dependently for deep (convolutional) Res Net with batch normalization with various standard image datasets. [...] Table 1: Test errors (%) of base and A(base) with guarantee where the operator A maps any given ﬁrst-order training algorithm to the two-phase version of the given algorithm with theoretical guarantees. The numbers indicate the mean test errors (and standard deviations in parentheses) over ﬁve random trials. The column of Augmentation shows No for no data augmentation, and Yes for data augmentation. The expressivity condition (assumption 1) was numerically veriﬁed to all datasets. Dataset # of training data Expressivity Condition Augmentation Base A(base) with guarantee MNIST 60000 [...] CIFAR-10 50000 [...] CIFAR-100 50000 [...] SVHN 73257 [...] Table 2: Test errors (%) of A(base) with guarantee for Kuzushiji-MNIST with different hyperparameters τ = τ0T and δ = δ0ϵ. The numbers indicate the mean test errors (and standard deviations in parentheses) over three random trials. The expressivity condition (Assumption 1) was numerically veriﬁed to hold for Kuzushiji-MNIST as well.
Dataset Splits	No	The paper mentions 'test and validation datasets' in the context of avoiding overfitting during hyperparameter tuning, stating they fixed hyperparameters a priori. However, it does not explicitly provide training, validation, and test dataset splits (e.g., percentages or sample counts) for the main experiments reported in Table 1 and Table 2.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, or cloud instances) used for running the experiments.
Software Dependencies	Yes	Table 1 column 3 summarizes the results of the veriﬁcation of assumption 1 for various datasets. Here, we used a randomly sampled w(1:H) returned from the default initialization of the Res Net with version 1.4.0. of Py Torch (Paszke et al. 2019) by setting random seed to be 1. This initialization is based on the implementation of (He et al. 2015). The condition of rank([h(H) X (w(1:H)), 1n]) = n was checked by using numpy.linalg.matrix_rank in Num Py version 1.18.1 with the default option (i.e., without any arguments except the matrix [h(H) X (w(1:H)), 1n]), which uses the standard method from (Press et al. 2007).
Experiment Setup	Yes	Concretely, we ﬁxed the mini-batch size to be 64, the weight decay rate to be 10 5, the momentum coefﬁcient to be 0.9, the ﬁrst phase learning rate to be ηt = 0.01 and the second phase learning rate to be ηt = 0.01 [0 d1:H, 1 d H+1] to only train the last layer. The last epoch T was ﬁxed a priori as T = 100 without data augmentation and T = 400 with data augmentation. [...] Based on the results from Kuzushiji-MNIST in Table 2, we ﬁxed τ0 = 0.6 and δ0 = 0.001 for all datasets.