Scaling Convex Neural Networks with Burer-Monteiro Factorization
Authors: Arda Sahiner, Tolga Ergen, Batu Ozturkler, John M. Pauly, Morteza Mardani, Mert Pilanci
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments with image classification task indicate that this BM factorization enables layerwise training of convex CNNs, which allows for convex networks for the first time to match the performance of multi-layer end-to-end trained non-convex CNNs. and 4 EXPERIMENTAL RESULTS |
| Researcher Affiliation | Collaboration | Arda Sahiner Arcus Inc. Stanford University Tolga Ergen LG AI Research Batu Ozturkler Stanford University John Pauly Stanford University Morteza Mardani NVIDIA Corporation Mert Pilanci Stanford University |
| Pseudocode | No | The paper describes methods and processes through mathematical formulations and prose, but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any concrete access to source code for the methodology described, such as a specific repository link, an explicit code release statement, or code in supplementary materials. |
| Open Datasets | Yes | We apply this procedure to the CIFAR-10 (Krizhevsky et al., 2009) and Fashion-MNIST (Xiao et al., 2017) datasets |
| Dataset Splits | No | The paper mentions 'test accuracy' and refers to 'training' on CIFAR-10 and Fashion-MNIST, but does not explicitly provide specific dataset split information (exact percentages, sample counts, or citations to predefined splits) needed to reproduce the data partitioning. |
| Hardware Specification | Yes | Our layerwise training procedure was trained on a single NVIDIA 1080 Ti GPU |
| Software Dependencies | No | The paper mentions 'Pytorch (Paszke et al., 2019)' but does not provide specific version numbers for Pytorch or any other ancillary software components used in the experiments. |
| Experiment Setup | Yes | In our experiments, we keep all network and optimization parameters the same, aside from replacing the non-convex CNN at each stage with our convex CNN objective (23). We then apply the Burer-Monteiro factorization with m [1, 2, 4] to this architecture to make it tractable for layerwise learning as described in the main paper. At each stage, we randomly subsample ˆP = 256 hyperplane arrangements. We further use gated Re LU rather than Re LU activations for simplicity, which can work as well as Re LU in practice (Fiat et al., 2019). ...a batch size of 128, weight decay parameter of β = 5e 4, along with stochastic gradient descent (SGD) with momentum fixed to 0.9, 50 epochs per stage, and learning rate decay by a factor of 0.2 every 15 epochs. ...The chosen learning rates were [10 1, 10 2, 10 3, 10 2, 10 2] for CIFAR-10. For Fashion-MNIST, we empirically observed the training loss was better optimized with slightly higher learning rates, so we used [2 10 1, 5 10 2, 5 10 3]. |