The Implicit Bias of Gradient Descent on Separable Data

Authors: Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Nathan Srebro

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Figure 1: Visualization of or main results on a synthetic dataset in which the L2 max margin vector ˆw is precisely known. (A) The dataset... Figure 3: Training of a convolutional neural network on CIFAR10 using stochastic gradient descent with constant learning rate and momentum, softmax output and a cross entropy loss, where we achieve 8.3% final validation error. Table 1: Sample values from various epochs in the experiment depicted in Fig. 3.
Researcher Affiliation Academia Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson Department of Electrical Engineering,Technion Haifa, 320003, Israel... Nathan Srebro Toyota Technological Institute at Chicago Chicago, Illinois 60637, USA
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code available here: https://github.com/paper-submissions/Max_Margin
Open Datasets Yes Figure 3: Training of a convolutional neural network on CIFAR10 using stochastic gradient descent with constant learning rate and momentum, softmax output and a cross entropy loss, where we achieve 8.3% final validation error.
Dataset Splits Yes The increase in the test loss is practically important because the loss on a validation set is frequently used to monitor progress and decide on stopping. Similar to the population loss, the validation loss Lval (w (t)) = P x V ℓ w (t) x calculated on an independent validation set V, will increase logarithmically with t (since we would not expect zero validation error)...
Hardware Specification No The paper does not specify any hardware details such as GPU models, CPU types, or memory used for running the experiments.
Software Dependencies No The paper mentions optimizers like ADAM and Ada Grad, and implicitly uses frameworks like PyTorch (from the code link), but it does not provide specific version numbers for any software dependencies.
Experiment Setup Yes Implementation details: The dataset includes four support vectors... We used a learning rate η = 1/σmax (X), where σmax (X) is the maximal singular value of X, momentum γ = 0.9 for GDMO, and initialized at the origin.