EigenGame Unloaded: When playing games is better than optimizing

Authors: Ian Gemp, Brian McWilliams, Claire Vernade, Thore Graepel

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate its performance with extensive experiments including dimensionality reduction of massive data sets and clustering a large social network graph.
Researcher Affiliation Industry Ian Gemp , Brian Mc Williams , Claire Vernade & Thore Graepel Deep Mind, London UK {imgemp,bmcw,vernade}@deepmind.com, thoregraepel@gmail.com
Pseudocode Yes Algorithm 1 presents pseudocode for µ-Eigen Game where computation is parallelized over the k players.
Open Source Code No For the sake of reproducibility we have included pseudocode in Jax. We use the Optax optimization library Hessel et al. (2020) and the Jaxline training framework.
Open Datasets Yes We compare µ-Eigen Game against α-Eigen Game, GHA (Sanger, 1989), Matrix Krasulina (Tang, 2019), and Oja s algorithm (Allen-Zhu and Li, 2017) on the MNIST dataset. ... The dataset consists a subset of the 40 billion words used to train the transformer-based Meena language model (Adiwardana et al., 2020). ... The Facebook graph consists of 134, 833 nodes, 1, 380, 293 edges, and 8 connected components... (Leskovec and Krevl, 2014; Rozemberczki et al., 2019).
Dataset Splits No For MNIST, it states 'Learning rates were chosen from {10 3, . . . , 10 6} on 10 held out runs,' which implies hyperparameter tuning, but it does not specify explicit training/validation/test dataset split percentages or sample counts. It refers to a 'training set' but provides no details on how it was split.
Hardware Specification Yes Specifically we consider the parallel framework specified by TPUv3 available in Google Cloud... We use minibatches of size 4,096 in each TPU. We do model parallelism across 4 TPUs... The experiment was run on a single CPU.
Software Dependencies No The paper mentions using 'Optax optimization library' and 'Jaxline training framework' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We use minibatches of size 4,096 in each TPU. We compute and apply updates using SGD with a learning rate of 5 10 5 and Nesterov momentum with a factor of 0.9. ... Learning rates were chosen from {10 3, . . . , 10 6} on 10 held out runs.