Early Time Classification with Accumulated Accuracy Gap Control

Authors: Liran Ringel, Regev Cohen, Daniel Freedman, Michael Elad, Yaniv Romano

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerical experiments demonstrate the effectiveness, applicability, and usefulness of our method. We show that our proposed early stopping mechanism reduces up to 94% of timesteps used for classification while achieving rigorous accuracy gap control.
Researcher Affiliation Collaboration 1Department of Computer Science, Technion Israel Institute of Technology, Haifa, Israel 2Verily AI, Israel 3Department of Electrical and Computer Engineering, Technion Israel Institute of Technology, Haifa, Israel. Correspondence to: Liran Ringel <liranringel@cs.technion.ac.il>.
Pseudocode Yes Algorithm 1 Candidate Screening (Stage 1), Algorithm 2 Testing (Stage 2), Algorithm A.3 Fixed sequence testing for marginal risk control
Open Source Code Yes A software package implementing the proposed methods is publicly available at Git Hub.1 (...) 1https://github.com/liranringel/etc
Open Datasets Yes We test the applicability of our methods on five datasets: Tiselac (Ienco, 2017), Electric Devices (Chen et al., 2015), Pen Digits (Alpaydin & Alimoglu, 1998), Crop (Tan et al., 2017), and Walking Sitting Standing (Reyes-Ortiz et al., 2012). These datasets are publicly available via the aeon toolkit. (...) The Qu ALITY dataset (Pang et al., 2022) (...) The Qu AIL dataset (Rogers et al., 2020).
Dataset Splits Yes For the calibration of the early stopping rule, we employ 3073 labeled samples to form Dcal while reserving the remaining 1536 samples for testing. (...) To implement and evaluate our methods, we partition each dataset into four distinct sets: 80% of the samples are allocated for model fitting, while the remaining samples are equally divided to form Dcal-1, Dcal-2, and Dtest. (...) We allocate 1/8 of the training samples to a validation set and optimize the model on the remaining 7/8 of the samples. Training continues until there is no improvement in the loss on the validation set for 30 epochs. The model with the best validation set loss is then saved.
Hardware Specification No The paper mentions using an 'LSTM model' and notes that models like 'Vicuna-13B' and 'Llama 2 70B' were used, accessible via Hugging Face. However, it does not specify the underlying hardware (e.g., GPU models, CPU types, memory) used to run or train these models for their experiments.
Software Dependencies No The paper mentions using the 'vLLM framework' and 'Adam' optimizer, along with 'LSTM' models. However, it does not provide specific version numbers for these or other software dependencies (e.g., Python, PyTorch, TensorFlow, specific libraries) required for reproducibility.
Experiment Setup Yes In all experiments, we set the target accuracy gap level to α = 10%, with δ = 1% and = 0.01. (...) We used a standard LSTM for feature extraction with one recurrent layer with a hidden size of 32, except for Walking Sitting Standing where we used 2 recurrent layers, each with a hidden size of 256. (...) We set the hyperparameter γ to 0.2 in all experiments. (...) The optimizer used to minimize the objective function is Adam (Kingma & Ba, 2014), with a learning rate of 0.001, and a batch size of 64. (...) Training continues until there is no improvement in the loss on the validation set for 30 epochs.