Scaling-up Diverse Orthogonal Convolutional Networks by a Paraunitary Framework
Authors: Jiahao Su, Wonmin Byeon, Furong Huang
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In the experiments, we achieve the following goals. (1) We demonstrate in Section 6.1 that our separable complete factorization (SC-Fac) achieves precise orthogonality (up to machine-precision), resulting in more accurate orthogonal designs than previous ones (Sedghi et al., 2019; Li et al., 2019b; Trockman & Kolter, 2021). (2) Despite the differences in preciseness, we show in Section 6.2 that different realizations of paraunitary systems only have a minor impact on the adversarial robustness of Lipschitz networks. (3) Due to the versatility of our convolutional layers and architectures, in Section 6.3, we explore the best strategy to scale Lipschitz networks to wider/deeper architectures. (4) In Appendix F, we further demonstrate in a successful application of orthogonal convolutions in residual flows (Chen et al., 2019). Training details are provided in Appendix E.1. |
| Researcher Affiliation | Collaboration | 1 University of Maryland, College Park, MD USA 2 NVIDIA Research, NVIDIA Corporation, Santa Clara, CA USA |
| Pseudocode | Yes | We include the pseudo-code for separable complete factorization (Section 2) in Algorithm 1 and diverse orthogonal convolutions (Section 3) in Algorithm 2. The pseudo-code in Algorithm 1 consists of three parts: (1) First, we obtain orthogonal matrices from skew-symmetric matrices using matrix exponential. We use Geo Torch library (Lezcano Casado, 2019) for the function matrix_exp in our implementation; (2) Subsequently, we construct two 1D paraunitary systems using these orthogonal matrices; (3) Lastly, we compose two 1D paraunitary systems to obtain one 2D paraunitary systems The pseudo-code in Algorithm 2 consists of two parts: (1) First, we reshape each paraunitary system into an orthogonal convolution depending on the stride; and (5) second, we concatenate the orthogonal kernels for different groups and return the output. |
| Open Source Code | Yes | Our code will be publicly available at https://github.com/ umd-huang-lab/ortho-conv. |
| Open Datasets | Yes | We use the CIFAR-10 dataset for all our experiments. We normalize all input images to [0, 1] followed by standard augmentation, including random cropping and horizontal flipping. We use the Adam optimizer with a maximum learning rate of 10 2 coupled with a piece-wise triangular learning rate scheduler. We initialize all our SC-Fac layers as permutation matrices: (1) we select the number of columns for each pair U (ℓ), U ( ℓ) uniformly from {1, , T} at initialization (the number is fixed during training); (2) for ℓ> 0, we sample the entries in U (ℓ) uniformly with respect to the Haar measure; (3) for ℓ< 0, we set U ( ℓ) = QU (ℓ) according to Proposition D.1. |
| Dataset Splits | No | The paper states it uses CIFAR-10 and MNIST datasets, which have standard splits, but it does not explicitly specify the training/validation/test splits (e.g., percentages or sample counts) used for reproducibility. |
| Hardware Specification | Yes | Missing numbers in Figure 4 and Table 7 (Appendix E) are due to the large memory requirement (on Tesla V100 32G). |
| Software Dependencies | No | The paper mentions using 'Geo Torch library (Lezcano Casado, 2019)' but does not provide specific version numbers for this library or any other software dependencies (e.g., Python, PyTorch, CUDA). |
| Experiment Setup | Yes | We use the Adam optimizer with a maximum learning rate of 10 2 coupled with a piece-wise triangular learning rate scheduler. We initialize all our SC-Fac layers as permutation matrices: (1) we select the number of columns for each pair U (ℓ), U ( ℓ) uniformly from {1, , T} at initialization (the number is fixed during training); (2) for ℓ> 0, we sample the entries in U (ℓ) uniformly with respect to the Haar measure; (3) for ℓ< 0, we set U ( ℓ) = QU (ℓ) according to Proposition D.1. For each model, we perform a grid search on different margins ϵ0 {1 10 3, 2 10 3, 5 10 3, 1 10 2, 2 10 2, 5 10 2, 0.1, 0.2, 0.5} and report the best performance in terms of robust accuracy. |