Kernelized Wasserstein Natural Gradient
Authors: M Arbel, A Gretton, W Li, G Montufar
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We verify its accuracy on simple examples, and show the advantage of using such an estimator in classification tasks on Cifar10 and Cifar100 empirically. This section presents an empirical evaluation of (KWNG) based on (19). Figure 3 shows the training and test accuracy at each epoch on Cifar10 in both (WC) and (IC) cases. |
| Researcher Affiliation | Academia | Michael Arbel, Arthur Gretton Gatsby Computational Neuroscience Unit University College London {michael.n.arbel,arthur.gretton}@gmail.com Wuchen Li University of California, Los Angeles wcli@math.ucla.edu Guido Montúfar University of California, Los Angeles, and Max Planck Institute for Mathematics in the Sciences montufar@mis.mpg.de |
| Pseudocode | No | The paper describes the proposed method but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code for the experiments is available at https://github.com/MichaelArbel/KWNG. |
| Open Datasets | Yes | We consider a classification task on two datasets Cifar10 and Cifar100 with a Residual Network He et al. (2015). |
| Dataset Splits | No | The paper mentions training and test accuracy but does not explicitly provide details about train/validation/test dataset splits or proportions. |
| Hardware Specification | No | The paper does not specify the hardware (e.g., CPU, GPU models) used for running the experiments. |
| Software Dependencies | No | The paper does not explicitly list specific software dependencies with version numbers (e.g., 'PyTorch 1.9', 'CUDA 11.1'). |
| Experiment Setup | Yes | For all methods, we used a batch-size of 128. The optimal step-size γ was selected in {10,1,10 1,10 2,10 3,10 4} for each method. In the case of SGD with momentum, we used a momentum parameter of 0.9 and a weight decay of either 0 or 5 10 4. For KFAC and EKFAC, we used a damping coefficient of 10 3 and a frequency of reparametrization of 100 updates. For KWGN we set M =5 and λ=0 while the initial value for ϵ is set to ϵ=10 5 and is adjusted using an adaptive scheme based on the Levenberg-Marquardt dynamics as in (Martens and Grosse, 2015, Section 6.5). More precisely, we use the following update equation for ϵ after every 5 iterations of the optimizer: ϵ ωϵ, if r> 3, ϵ ω 1ϵ, if r< 1. Here, r is the reduction ratio: r= max tk 1 t tk Å 2 L(θt)) L(θt+1) where (tk)k are the times when the updates occur. and ω is the decay constant chosen to ω=0.85. |