Reparameterization through Spatial Gradient Scaling
Authors: Alexander Detkov, Mohammad Salameh, Muhammad Fetrat, Jialin Zhang, Robin Luwei, SHANGLING JUI, Di Niu
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on CIFAR-10, CIFAR-100, and Image Net show that without searching for reparameterized structures, our proposed scaling method outperforms the state-of-the-art reparameterization strategies at a lower computational cost. The code is available at https://github.com/Ascend-Research/Reparameterization. |
| Researcher Affiliation | Collaboration | Alexander Detkov1, , Mohammad Salameh2, , Muhammad Fetrat Qharabagh1,*, , Jialin Zhang3, Wei Lui2, Shangling Jui3, Di Niu1 1University of Alberta, 2Huawei Technologies, 3Huawei Kirin Solutions |
| Pseudocode | Yes | An overview of the SGS framework is given as pseudo-code in Appendix A.5, and details can be found in the corresponding open-source code. |
| Open Source Code | Yes | The code is available at https://github.com/Ascend-Research/Reparameterization. |
| Open Datasets | Yes | Experiments on CIFAR-10, CIFAR-100, and Image Net show that without searching for reparameterized structures, our proposed scaling method outperforms the state-of-the-art reparameterization strategies at a lower computational cost. |
| Dataset Splits | Yes | We search for k on CIFAR100 and use the optimal for experiments on CIFAR10 and Image Net. We perform a grid search on CIFAR100 and VGG-16 over k {2, 3, 4, 5, 6, 7} using 20% of the training set for validation. |
| Hardware Specification | Yes | Training is done on a single NVIDIA Tesla V100 GPU. ... on 8 NVIDIA Tesla V100 GPUs. |
| Software Dependencies | No | The paper mentions 'Py Torch defaults' for optimizer settings but does not specify version numbers for PyTorch or any other software libraries or dependencies used in the experiments. |
| Experiment Setup | Yes | We train VGG-16 on CIFAR-{10,100} for 600 epochs with a batch size of 128, cosine annealing scheduler with an initial learning rate of 0.1, and SGD optimizer with momentum 0.9 and weight decay 1e-4. We update our spatial gradient scalings every 30 epochs using 20 random batches from the training set. We add a 1 epoch warm-up period at the start of training before generating our gradient scalings. |