Gradient Descent Maximizes the Margin of Homogeneous Neural Networks
Authors: Kaifeng Lyu, Jian Li
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct several experiments to justify our theoretical finding on MNIST and CIFAR-10 datasets. (Abstract) and Experiments.2 The main practical implication of our theoretical result is that training longer can enlarge the normalized margin. To justify this claim empiricaly, we train CNNs on MNIST and CIFAR-10 with SGD (see Section K.1). |
| Researcher Affiliation | Academia | Kaifeng Lyu & Jian Li Institute for Interdisciplinary Information Sciences Tsinghua University Beijing, China vfleaking@gmail.com,lijian83@mail.tsinghua.edu.cn |
| Pseudocode | No | The paper describes procedures in paragraph form (e.g., in Appendix L.1 for loss-based learning rate scheduling) but does not contain explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code available: https://github.com/vfleaking/max-margin |
| Open Datasets | Yes | We conduct several experiments to justify our theoretical finding on MNIST and CIFAR-10 datasets. |
| Dataset Splits | No | The paper mentions training on MNIST and CIFAR-10 datasets and evaluating test accuracy, but it does not explicitly provide details on validation dataset splits (e.g., percentages, sample counts, or specific methodology). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | We trained two models with Tensorflow. (Section K). The paper mentions TensorFlow but does not provide a specific version number, nor does it list other software components with version numbers. |
| Experiment Setup | Yes | In training the models, we use SGD with batch size 100 without momentum. We initialize all layer weights by He normal initializer (He et al., 2015) and all bias terms by zero. (Section K). In all our experiments, we set α(0) := 0.1, ru := 21/5 1.149, rd := 21/10 1.072. (Appendix L.1). |