On Solving Minimax Optimization Locally: A Follow-the-Ridge Approach

Authors: Yuanhao Wang*, Guodong Zhang*, Jimmy Ba

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, FR solves toy minimax problems and improves the convergence of GAN training compared to the recent minimax optimization algorithms.
Researcher Affiliation Academia Yuanhao Wang 1, Guodong Zhang 2,3, Jimmy Ba2,3 1IIIS, Tsinghua University, 2University of Toronto, 3Vector Institute
Pseudocode Yes Algorithm 1 Follow-the-Ridge (FR). Differences from gradient descent-ascent are shown in blue.
Open Source Code Yes 1Our code is made public at: https://github.com/gd-zhang/Follow-the-Ridge
Open Datasets Yes We use the standard MNIST dataset (Le Cun et al., 1998)
Dataset Splits No No specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) for training, validation, and testing was explicitly provided. For MNIST, it states “For each class, we take 4,800 training examples. Overall, we have 9,800 examples.” but does not detail how the remaining data might be split for validation or testing.
Hardware Specification No No specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running experiments were mentioned in the paper.
Software Dependencies No No specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment were provided. The paper mentions “RMSprop” and “conjugate gradient” as methods, but not software implementations with version numbers.
Experiment Setup Yes To satisfy the non-singular Hessian assumption, we add L2 regularization (0.0002) to the discriminator. For both generator and discriminator, we use 2-hidden-layers MLP with 64 hidden units each layer where tanh activations is used. By default, RMSprop (Tieleman and Hinton, 2012) is used in all our experiments while the learning rate is tuned for GDA... For both generator and discriminator, we use learning rate 0.0002. In terms of network architectures, we use 2-hidden-layers MLP with 512 hidden units in each layer for both the discriminator and the generator. For the discriminator, we use Sigmoid activation in the output layer. We use RMSProp as our base optimizer in the experiments with batch size 2,000. We run both GDA and FR for 100,000 iterations.