The Implicit Bias of AdaGrad on Separable Data

Authors: Qian Qian, Xiaoyuan Qian

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We prove that the directions of Ada Grad iterates, with a constant step size sufficiently small, always converge. We formulate the asymptotic direction as the solution of a quadratic optimization problem. This achieves a theoretical characterization of the implicit bias of Ada Grad, which also provides insights about why and how the factors involved, such as certain intrinsic properties of the dataset, the initialization and the learning rate, affect the implicit bias. We introduce a novel approach to study the bias of Ada Grad. It is mainly based on a geometric estimation on the directions of the updates, which doesn t depend on any calculation on the convergence rate.
Researcher Affiliation Academia Qian Qian Department of Statistics Ohio State University Columbus, OH 43210, USA qian.216@osu.edu Xiaoyuan Qian School of Mathematical Sciences Dalian University of Technology Dalian, Liaoning 116024, China xyqian@dlut.edu.cn
Pseudocode No The paper presents mathematical equations for the Ada Grad iterates (e.g., equation (1) on page 3) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any statement about releasing source code or links to a code repository.
Open Datasets No The paper mentions a "training dataset" (Let {(xn, yn) : n = 1, , N} be a training dataset) but does not provide access information (e.g., a specific name, link, or citation) for a publicly available or open dataset.
Dataset Splits No The paper does not provide specific information about training, validation, or test dataset splits.
Hardware Specification No The paper conducts "Numerical simulations" but does not specify any hardware details (e.g., CPU, GPU models, memory) used for these simulations.
Software Dependencies No The paper does not provide specific software dependencies (e.g., programming languages, libraries, or solvers) with version numbers that would be needed to replicate the experiments.
Experiment Setup Yes Given two hyperparameters ϵ , η > 0 and an initial point w(0) Rp, we consider the diagonal Ada Grad iterates... Numerical simulations also reveal the differences among the asymptotic directions of Ada Grad iterates with various learning rates, as shown in Figure 2... with η = 0.1 and 0.5, respectively.