Towards Understanding the Condensation of Neural Networks at Initial Training

Authors: Hanxu Zhou, Zhou Qixuan, Tao Luo, Yaoyu Zhang, Zhi-Qin Xu

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we illustrate the formation of the condensation in multi-layer fully connected NNs and show that the maximal number of condensed orientations in the initial training stage is twice the multiplicity of the activation function, where multiplicity indicates the multiple roots of activation function at origin. Our theoretical analysis confirms experiments for two cases, one is for the activation function of multiplicity one with arbitrary dimension input, which contains many common activation functions, and the other is for the layer with one-dimensional input and arbitrary multiplicity.
Researcher Affiliation Academia Hanxu Zhou1, Qixuan Zhou1, Tao Luo1,2, Yaoyu Zhang1,3 , Zhi-Qin John Xu1 , 1 School of Mathematical Sciences, Institute of Natural Sciences, MOE-LSC and Qing Yuan Research Institute, Shanghai Jiao Tong University 2 CMA-Shanghai, Shanghai Artificial Intelligence Laboratory 3 Shanghai Center for Brain Science and Brain-Inspired Technology
Pseudocode No The paper does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] In the supplemental material.
Open Datasets Yes For CIFAR10 dataset: We use Resnet18-like neural network... For synthetic dataset: Throughout this work, we use fully-connected neural network with size, dm-m-dout. ... The training data is 80 points sampled from P5 k=1 3.5 sin(5xk + 1)... The training data is 40 points uniformly sampled from sin(3x) + sin(6x)/2
Dataset Splits No The paper specifies total data size 'n' and uses it for training (e.g., 'n = 80 training data sampled', 'n = 40 points uniformly sampled'). It does not explicitly define training, validation, and test splits by percentage or absolute counts for reproducibility, nor does it mention a dedicated validation set.
Hardware Specification No The checklist for the paper states: 'Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No]'.
Software Dependencies No The paper mentions using 'Adam optimizer' but does not specify version numbers for any software components, libraries, or programming languages.
Experiment Setup Yes All parameters are initialized by a Gaussian distribution N(0, var). The training method is Adam with full batch, learning rate lr and MSE loss. ... The FC layers are initialized by N(0, 1 m3 out ), and Adam optimizer with cross-entropy loss and batch size 128 are used for all experiments. The learning rate is 3 10 8, 1 10 8, 1 10 8 and 5 10 6, separately.