How Does Adaptive Optimization Impact Local Neural Network Geometry?
Authors: Kaiqi Jiang, Dhruv Malik, Yuanzhi Li
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments on language models where adaptive algorithms converge faster than vanilla gradient methods like SGD, we show that adaptive methods such as Adam bias the trajectories towards regions where RAdam med is small... We complement these empirical observations with a theoretical result that provably demonstrates this phenomenon in the simplified setting of a two-layer linear network. |
| Researcher Affiliation | Academia | Kaiqi Jiang Department of Electrical and Computer Engineering Princeton University Princeton, NJ 08540 kaiqij@princeton.edu Dhruv Malik Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213 dhruvm@andrew.cmu.edu Yuanzhi Li Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213 yuanzhil@andrew.cmu.edu |
| Pseudocode | No | The paper describes the update rules for SGD+M and Adam using mathematical formulas, but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not include any explicit statements about releasing source code or links to a code repository. |
| Open Datasets | Yes | We fine-tuned BERT-small [TCLT19, BDR21] on the IMDB dataset [MDP+11]... We trained a Seq2Seq network that uses Transformer to solve a machine translation task on Multi30k [EFSS16]... We trained a transformer model to solve this problem on Wikitext-2 [MXBS17] |
| Dataset Splits | No | The paper specifies training duration in epochs and iterations (e.g., '10 epochs (2500 iterations)' or '60 epochs (1800 iterations)'), but does not provide specific dataset split information (percentages, counts, or methodology) for training, validation, or test sets. |
| Hardware Specification | No | The paper does not provide specific hardware details such as exact GPU or CPU models, processor types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions 'Pytorch' (e.g., 'default initialization in Pytorch'), but does not specify version numbers for Pytorch or any other software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | The momentum parameter β in SGD was set as 0.9. The two momentum parameters (β1, β2) of Adam were set as (0.9, 0.999). We trained the model using linearly decreasing learning rates for 10 epochs (2500 iterations). The initial learning rates of SGD+M and Adam were 0.001 and 5e-5, respectively. ... We used large batch sizes to make the training procedure stable. When computing Hessian, we also used large batch sizes. Due to the extremely large dimension, we did the computation on some uniformly selected coordinates, more precisely, 200 coordinates per layer. |