Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data

Authors: Spencer Frei, Gal Vardi, Peter Bartlett, Nathan Srebro, Wei Hu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide experiments which suggest that a small initialization scale is important for finding low-rank neural networks with gradient descent.
Researcher Affiliation Collaboration Spencer Frei UC Berkeley frei@berkeley.edu Gal Vardi TTI Chicago and Hebrew University galvardi@ttic.edu Peter L. Bartlett UC Berkeley and Google peter@berkeley.edu Nathan Srebro TTI Chicago nati@ttic.edu Wei Hu University of Michigan vvh@umich.edu
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not include an explicit statement about releasing source code for the methodology or provide a link to a code repository.
Open Datasets Yes We then consider the stable rank of two-layer networks trained by SGD for the CIFAR10 dataset... We use the standard 10-class CIFAR10 dataset with pixel values normalized to be between 0 and 1 (dividing each pixel value by 255).
Dataset Splits No The paper mentions 'val_acc' in figures (e.g., Figure 2, Figure 5, Figure 6), implying the use of a validation set. However, it does not provide specific details on how this validation split was created (e.g., percentages, sample counts, or the methodology for splitting the data) for reproducibility.
Hardware Specification No The paper does not specify the exact hardware used for running the experiments (e.g., specific GPU or CPU models, memory details, or cloud instance types).
Software Dependencies No The paper mentions 'Tensor Flow default initialization' but does not provide specific version numbers for TensorFlow or any other software libraries or dependencies used in the experiments.
Experiment Setup Yes We fix n = 100 samples with mean separation µ = d0.26 with each entry of µ identical and positive... For the figure on the left, the initialization is standard normal distribution with standard deviation that is 50 smaller than the Tensor Flow default initialization, that is, ωinit = 1/50 ωTF init where ωTF init = p2/(m + d). For the figure on the right, we fix d = 104 and vary the initialization standard deviation for different multiples of ωTF init, so that the variance is between (10 2ωTF init)2 and (102ωTF init)2. For the experiment on the effect of dimension, we use a fixed learning rate of α = 0.01, while for the experiment on the effect of the initialization scale we use a learning rate of α = 0.16... We train for T = 106 steps with SGD with batch size 128 and a learning rate of α = 0.01.