Semi-flat minima and saddle points by embedding neural networks to overparameterization
Authors: Kenji Fukumizu, Shoichiro Yamaguchi, Yoh-ichi Mototake, Mirai Tanaka
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerical experiments. We made experiments on the generalization errors of networks with Re LU and tanh in overparameterization. The input and output dimension is 1. Training data of size 10 are given by N1 (one hidden unit) for the respective models with additive noise ε N(0, 10−2) in the output. We first trained three-layer networks with each activation to achieve zero training error (< 10−29 in squared errors) with minimum number of hidden units (H0 = 5 in both models). See Figure 2(a) for an example of fitting by the Re LU network. We used the method of inactive units for embedding to NH, and perturb the whole parameters with N(0, ρ2), where ρ is the 0.01 θ(H0) . The code is available in Supplements. Figure 2(b) shows the ratio of the generalization errors (average and standard error for 1000 trials) of NH over NH0 as increasing H. We can see that, as more surplus units are added, the generalization errors increase for the tanh networks, while the Re LU networks do not show such increase. |
| Researcher Affiliation | Collaboration | The Institute of Statistical Mathematics Tachikawa, Tokyo 190-8562, Japan {fukumizu, mototake, mirai}@ism.ac.jp Preferred Networks, Inc. Chiyoda-ku, Tokyo 100-0004, Japan guguchi@preferred.jp |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available in Supplements. |
| Open Datasets | No | Training data of size 10 are given by N1 (one hidden unit) for the respective models with additive noise ε N(0, 10−2) in the output. The paper describes generating its own training data, but does not provide access to a public dataset or cite a specific public dataset with proper attribution. |
| Dataset Splits | No | The paper discusses 'Training data' and 'Generalization Errors' but does not specify explicit training/validation/test splits, percentages, or sample counts for these partitions. |
| Hardware Specification | No | The paper does not specify any hardware details such as GPU models, CPU types, or cloud computing resources used for the experiments. |
| Software Dependencies | No | The paper does not specify any software versions or dependencies (e.g., Python version, specific library versions like PyTorch or TensorFlow). |
| Experiment Setup | Yes | We first trained three-layer networks with each activation to achieve zero training error (< 10−29 in squared errors) with minimum number of hidden units (H0 = 5 in both models). ... We used the method of inactive units for embedding to NH, and perturb the whole parameters with N(0, ρ2), where ρ is the 0.01 θ(H0). |