Reparameterization through Spatial Gradient Scaling

Authors: Alexander Detkov, Mohammad Salameh, Muhammad Fetrat, Jialin Zhang, Robin Luwei, SHANGLING JUI, Di Niu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on CIFAR-10, CIFAR-100, and Image Net show that without searching for reparameterized structures, our proposed scaling method outperforms the state-of-the-art reparameterization strategies at a lower computational cost. The code is available at https://github.com/Ascend-Research/Reparameterization.
Researcher Affiliation Collaboration Alexander Detkov1, , Mohammad Salameh2, , Muhammad Fetrat Qharabagh1,*, , Jialin Zhang3, Wei Lui2, Shangling Jui3, Di Niu1 1University of Alberta, 2Huawei Technologies, 3Huawei Kirin Solutions
Pseudocode Yes An overview of the SGS framework is given as pseudo-code in Appendix A.5, and details can be found in the corresponding open-source code.
Open Source Code Yes The code is available at https://github.com/Ascend-Research/Reparameterization.
Open Datasets Yes Experiments on CIFAR-10, CIFAR-100, and Image Net show that without searching for reparameterized structures, our proposed scaling method outperforms the state-of-the-art reparameterization strategies at a lower computational cost.
Dataset Splits Yes We search for k on CIFAR100 and use the optimal for experiments on CIFAR10 and Image Net. We perform a grid search on CIFAR100 and VGG-16 over k {2, 3, 4, 5, 6, 7} using 20% of the training set for validation.
Hardware Specification Yes Training is done on a single NVIDIA Tesla V100 GPU. ... on 8 NVIDIA Tesla V100 GPUs.
Software Dependencies No The paper mentions 'Py Torch defaults' for optimizer settings but does not specify version numbers for PyTorch or any other software libraries or dependencies used in the experiments.
Experiment Setup Yes We train VGG-16 on CIFAR-{10,100} for 600 epochs with a batch size of 128, cosine annealing scheduler with an initial learning rate of 0.1, and SGD optimizer with momentum 0.9 and weight decay 1e-4. We update our spatial gradient scalings every 30 epochs using 20 random batches from the training set. We add a 1 epoch warm-up period at the start of training before generating our gradient scalings.