Semi-flat minima and saddle points by embedding neural networks to overparameterization

Authors: Kenji Fukumizu, Shoichiro Yamaguchi, Yoh-ichi Mototake, Mirai Tanaka

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerical experiments. We made experiments on the generalization errors of networks with Re LU and tanh in overparameterization. The input and output dimension is 1. Training data of size 10 are given by N1 (one hidden unit) for the respective models with additive noise ε N(0, 10−2) in the output. We first trained three-layer networks with each activation to achieve zero training error (< 10−29 in squared errors) with minimum number of hidden units (H0 = 5 in both models). See Figure 2(a) for an example of fitting by the Re LU network. We used the method of inactive units for embedding to NH, and perturb the whole parameters with N(0, ρ2), where ρ is the 0.01 θ(H0) . The code is available in Supplements. Figure 2(b) shows the ratio of the generalization errors (average and standard error for 1000 trials) of NH over NH0 as increasing H. We can see that, as more surplus units are added, the generalization errors increase for the tanh networks, while the Re LU networks do not show such increase.
Researcher Affiliation Collaboration The Institute of Statistical Mathematics Tachikawa, Tokyo 190-8562, Japan {fukumizu, mototake, mirai}@ism.ac.jp Preferred Networks, Inc. Chiyoda-ku, Tokyo 100-0004, Japan guguchi@preferred.jp
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code Yes The code is available in Supplements.
Open Datasets No Training data of size 10 are given by N1 (one hidden unit) for the respective models with additive noise ε N(0, 10−2) in the output. The paper describes generating its own training data, but does not provide access to a public dataset or cite a specific public dataset with proper attribution.
Dataset Splits No The paper discusses 'Training data' and 'Generalization Errors' but does not specify explicit training/validation/test splits, percentages, or sample counts for these partitions.
Hardware Specification No The paper does not specify any hardware details such as GPU models, CPU types, or cloud computing resources used for the experiments.
Software Dependencies No The paper does not specify any software versions or dependencies (e.g., Python version, specific library versions like PyTorch or TensorFlow).
Experiment Setup Yes We first trained three-layer networks with each activation to achieve zero training error (< 10−29 in squared errors) with minimum number of hidden units (H0 = 5 in both models). ... We used the method of inactive units for embedding to NH, and perturb the whole parameters with N(0, ρ2), where ρ is the 0.01 θ(H0).