Efficient Riemannian Optimization on the Stiefel Manifold via the Cayley Transform

Authors: Jun Li, Fuxin Li, Sinisa Todorovic

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments for CNN training demonstrate that both algorithms: (a) Use less running time per iteration relative to existing approaches that enforce orthonormality of CNN parameters; and (b) Achieve faster convergence rates than the baseline SGD and ADAM algorithms without compromising the performance of the CNN.
Researcher Affiliation Academia Jun Li, Li Fuxin, Sinisa Todorovic School of EECS Oregon State University Corvallis, OR 97331 {liju2,lif,sinisa}@oregonstate.edu
Pseudocode Yes Algorithm 1 Cayley SGD with Momentum; Algorithm 2 Cayley ADAM
Open Source Code No The paper does not provide any explicit statement about releasing source code or a link to a code repository.
Open Datasets Yes Datasets: We evaluate Cayley SGD or Cayley ADAM in image classification on the CIFAR10 and CIFAR100 datasets (Krizhevsky & Hinton, 2009). CIFAR10 and CIFAR100 consist of of 50,000 training images and 10,000 test images, and have 10 and 100 mutually exclusive classes.
Dataset Splits Yes Pixel-by-pixel MNIST: ...we select 5,000 out of the 60,000 training examples for the early stopping validation.
Hardware Specification Yes All algorithms are run on one TITAN Xp GPU.
Software Dependencies No The paper does not specify software versions for its implementation (e.g., Python, PyTorch, TensorFlow versions are not listed).
Experiment Setup Yes Training Strategies: We use different learning rates le and lst for weights on the Euclidean space and the Stiefel manifold, respectively. We set the weight decay as 0.0005, momentum as 0.9, and minibatch size as 128. The initial learning rates are set as le = 0.01 and lst = 0.2 for Cayley SGD and le = 0.01 and lst = 0.4 for Cayley ADAM. During training, we reduce the learning rates by a factor of 0.2 at 60, 120, and 160 epochs. The total number of epochs in training is 200. In training, the data samples are normalized using the mean and variance of the training set, and augmented by randomly flipping training images.