Principled Architecture-aware Scaling of Hyperparameters

Authors: Wuyang Chen, Junru Wu, Zhangyang Wang, Boris Hanin

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We verify our principles with comprehensive experiments. More importantly, our strategy further sheds light on advancing current benchmarks for architecture design. A fair comparison of Auto ML algorithms requires accurate network rankings. However, we demonstrate that network rankings can be easily changed by better training networks in benchmarks with our architecture-aware learning rates and initialization.
Researcher Affiliation Collaboration 1University of Texas, Austin 2Google Research 3Princeton University
Pseudocode No The paper contains mathematical derivations and definitions (e.g., Section 3.1, Appendix A, B, C) but no explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https: //github.com/VITA-Group/principled_scaling_lr_init.
Open Datasets Yes We study our principles on MLPs, CNNs, and networks with advanced architectures from NAS-Bench-201 (Dong & Yang, 2020). All our experiments are repeated for three random runs. We also include more results in the Appendix ( E.1 for Image Net, E.2 for the Ge LU activation).
Dataset Splits No The paper mentions using NAS-Bench-201 and adopting its training protocol, which implies predefined splits. However, it does not explicitly state the specific dataset split percentages or sample counts used for training, validation, or testing within the paper's main text for its own experiments.
Hardware Specification No The paper does not provide specific details about the hardware used for experiments, such as GPU or CPU models, or cloud computing specifications beyond general mentions.
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup Yes We adapt our principles as follows: 1. Find the base maximal learning rate of the base architecture with the smallest depth (L = 1 in our case: input hidden output ): conduct a grid search over a range of learning rates, and find the maximal learning rate which achieves the smallest training loss at the end of one epoch... We randomly sample architectures from NAS-Bench-201, and adopt the same training protocol as Dong & Yang (2020) (batch size, warm-up, learning rate decay, etc.).