Surfing: Iterative Optimization Over Incrementally Trained Deep Networks

Authors: Ganlin Song, Zhou Fan, John Lafferty

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present experiments to illustrate the performance of surfing over a sequence of networks during training compared with gradient descent over the final trained network. ... Table 1 shows the percentage of trials where the solutions bx T satisfy our criterion for successful recovery bx T x < 0.01, for different models and over three different input dimensions k.
Researcher Affiliation Academia Ganlin Song Department of Statistics and Data Science Yale University ganlin.song@yale.edu Zhou Fan Department of Statistics and Data Science Yale University zhou.fan@yale.edu John Lafferty Department of Statistics and Data Science Yale University john.lafferty@yale.edu
Pseudocode Yes Algorithm 1 Surfing ... Algorithm 2 Projected-gradient Surfing
Open Source Code No The paper does not provide any statement or link regarding the public availability of its source code.
Open Datasets Yes We mainly use the Fashion-MNIST dataset to carry out the simulations, which is similar to MNIST in many characteristics, but is more difficult to train.
Dataset Splits No The paper does not provide specific details on train/validation/test splits for the Fashion-MNIST dataset, nor does it refer to standard splits for reproduction.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments.
Software Dependencies No The paper mentions software components like 'Adam' and 'batch normalization' and generative models like 'VAE', 'DCGAN', 'WGAN', and 'WGAN-GP', but does not provide specific version numbers for any software libraries or dependencies.
Experiment Setup Yes We run surfing by taking a sequence of parameters θ0, θ1, ..., θT for T = 100, where θ0 are the initial random parameters and the intermediate θt s are taken every 40 training steps, and we use Adam (Kingma and Ba, 2014) to carry out gradient descent in x over each network Gθt. ... The total number of iterations for networks Gθ0, . . . , GθT 1 is set as the 75th-percentile of the iteration count required for convergence of regular Adam. These are split across the networks proportional to a deterministic schedule that allots more steps to the earlier networks where the landscape of G(x) changes more rapidly, and fewer steps to later networks where this landscape stabilizes.