AUTOMATA: Gradient Based Data Subset Selection for Compute-Efficient Hyper-parameter Tuning

Authors: Krishnateja Killamsetty, Guttu Sai Abhishek, Aakriti Lnu, Ganesh Ramakrishnan, Alexandre Evfimievski, Lucian Popa, Rishabh Iyer

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically evaluate the effectiveness of AUTOMATA in hyper-parameter tuning through several experiments on real-world datasets in the text, vision, and tabular domains. Our experiments show that using gradient-based data subsets for hyper-parameter tuning achieves significantly faster turnaround times and speedups of 3 -30 while achieving comparable performance to the hyper-parameters found using the entire dataset.
Researcher Affiliation Collaboration 1 The University of Texas at Dallas, USA 2Indian Institute of Technology Bombay, India 3 IBM Research, USA
Pseudocode Yes Detailed pseudocode of the AUTOMATA algorithm is provided in Appendix C due to space constraints in the main paper.
Open Source Code No The paper states it uses existing open-source libraries like PyTorch, Ray-tune, and CORDS. However, it does not explicitly state that the source code for the AUTOMATA framework itself is being released or provide a link to its implementation code.
Open Datasets Yes Text datasets include SST2 [47], SST5 [47], glue-SST2 [51], and TREC6 [35, 18]. Image datasets include CIFAR10 [28], CIFAR100 [28], and Street View House Numbers (SVHN) [39]. Tabular datasets include DNA, SATIMAGE, LETTER, and CONNECT-4 from LIBSVM (a library for Support Vector Machines (SVMs)) [7].
Dataset Splits Yes We give more details on dataset sizes and splits in Appendix G.2.
Hardware Specification Yes a single training run using a relatively simple model class of Residual Networks [16] for 300 epochs on a V100 GPU takes around 4 hours.
Software Dependencies No The paper mentions using "the popular deep learning framework [41]" (referring to PyTorch), Ray-tune[36], and CORDS [22]. However, it does not specify version numbers for these software dependencies, which is required for reproducibility.
Experiment Setup Yes For text datasets, we train the LSTM model for 20 epochs while choosing subsets (except for FULL) every 5 epochs. The hyper-parameter space includes learning rate, hidden size & number of layers of LSTM, batch size of training. Some experiments (with TPE as the search algorithm) use 27 configurations in the hyper-parameter space, while others use 54. More details on hyper-parameter search space for text datasets are given in Appendix G.4.1. For image datasets, we train the Res Net [16] model for 300 epochs while choosing subsets (except for FULL) every 20 epochs i.e., R = 20. We use a Stochastic Gradient Descent (SGD) optimizer with momentum set to 0.9 and weight decay factor set to 0.0005. The hyper-parameter search space consists of a choice between the Momentum method and Nesterov Accelerated Gradient method, choice of learning rate scheduler and their corresponding parameters, and four different group-wise learning rates. We use 27 configurations in the hyper-parameter space for Image datasets. More details on hyper-parameter search space for image datasets are given in Appendix G.4.2. For tabular datasets, we train a multi-layer perceptron with 2 hidden layers for 200 epochs while choosing subsets every 10 epochs. The hyper-parameter search space consists of a choice between the SGD optimizer or Adam optimizer, choice of learning rate, choice of learning rate scheduler, the sizes of the two hidden layers and batch size for training. We use 27 configurations in the hyper-parameter space for Tabular datasets. More details on hyper-parameter search space for tabular datasets are provided in Appendix G.4.3.