Efficient Riemannian Optimization on the Stiefel Manifold via the Cayley Transform
Authors: Jun Li, Fuxin Li, Sinisa Todorovic
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments for CNN training demonstrate that both algorithms: (a) Use less running time per iteration relative to existing approaches that enforce orthonormality of CNN parameters; and (b) Achieve faster convergence rates than the baseline SGD and ADAM algorithms without compromising the performance of the CNN. |
| Researcher Affiliation | Academia | Jun Li, Li Fuxin, Sinisa Todorovic School of EECS Oregon State University Corvallis, OR 97331 {liju2,lif,sinisa}@oregonstate.edu |
| Pseudocode | Yes | Algorithm 1 Cayley SGD with Momentum; Algorithm 2 Cayley ADAM |
| Open Source Code | No | The paper does not provide any explicit statement about releasing source code or a link to a code repository. |
| Open Datasets | Yes | Datasets: We evaluate Cayley SGD or Cayley ADAM in image classification on the CIFAR10 and CIFAR100 datasets (Krizhevsky & Hinton, 2009). CIFAR10 and CIFAR100 consist of of 50,000 training images and 10,000 test images, and have 10 and 100 mutually exclusive classes. |
| Dataset Splits | Yes | Pixel-by-pixel MNIST: ...we select 5,000 out of the 60,000 training examples for the early stopping validation. |
| Hardware Specification | Yes | All algorithms are run on one TITAN Xp GPU. |
| Software Dependencies | No | The paper does not specify software versions for its implementation (e.g., Python, PyTorch, TensorFlow versions are not listed). |
| Experiment Setup | Yes | Training Strategies: We use different learning rates le and lst for weights on the Euclidean space and the Stiefel manifold, respectively. We set the weight decay as 0.0005, momentum as 0.9, and minibatch size as 128. The initial learning rates are set as le = 0.01 and lst = 0.2 for Cayley SGD and le = 0.01 and lst = 0.4 for Cayley ADAM. During training, we reduce the learning rates by a factor of 0.2 at 60, 120, and 160 epochs. The total number of epochs in training is 200. In training, the data samples are normalized using the mean and variance of the training set, and augmented by randomly flipping training images. |