Gradient Descent Maximizes the Margin of Homogeneous Neural Networks

Authors: Kaifeng Lyu, Jian Li

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct several experiments to justify our theoretical finding on MNIST and CIFAR-10 datasets. (Abstract) and Experiments.2 The main practical implication of our theoretical result is that training longer can enlarge the normalized margin. To justify this claim empiricaly, we train CNNs on MNIST and CIFAR-10 with SGD (see Section K.1).
Researcher Affiliation Academia Kaifeng Lyu & Jian Li Institute for Interdisciplinary Information Sciences Tsinghua University Beijing, China vfleaking@gmail.com,lijian83@mail.tsinghua.edu.cn
Pseudocode No The paper describes procedures in paragraph form (e.g., in Appendix L.1 for loss-based learning rate scheduling) but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code available: https://github.com/vfleaking/max-margin
Open Datasets Yes We conduct several experiments to justify our theoretical finding on MNIST and CIFAR-10 datasets.
Dataset Splits No The paper mentions training on MNIST and CIFAR-10 datasets and evaluating test accuracy, but it does not explicitly provide details on validation dataset splits (e.g., percentages, sample counts, or specific methodology).
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, or detailed computer specifications) used for running its experiments.
Software Dependencies No We trained two models with Tensorflow. (Section K). The paper mentions TensorFlow but does not provide a specific version number, nor does it list other software components with version numbers.
Experiment Setup Yes In training the models, we use SGD with batch size 100 without momentum. We initialize all layer weights by He normal initializer (He et al., 2015) and all bias terms by zero. (Section K). In all our experiments, we set α(0) := 0.1, ru := 21/5 1.149, rd := 21/10 1.072. (Appendix L.1).