Post-Hoc Reversal: Are We Selecting Models Prematurely?

Authors: Rishabh Ranjan, Saurabh Garg, Mrigank Raman, Carlos Guestrin, Zachary Lipton

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we challenge this practice with an extensive empirical study. In particular, we demonstrate a phenomenon that we call post-hoc reversal, where performance trends are reversed after applying post-hoc transforms. This phenomenon is especially prominent in high-noise settings. For example, while base models overfit badly early in training, both ensembling and SWA favor base models trained for more epochs. Post-hoc reversal can also prevent the appearance of double descent and mitigate mismatches between test loss and test error seen in base models. Preliminary analyses suggest that these transforms induce reversal by suppressing the influence of mislabeled examples, exploiting differences in their learning dynamics from those of clean examples. Based on our findings, we propose post-hoc selection, a simple technique whereby post-hoc metrics inform model development decisions such as early stopping, checkpointing, and broader hyperparameter choices. Our experiments span real-world vision, language, tabular and graph datasets.
Researcher Affiliation Academia Rishabh Ranjan1 , Saurabh Garg2, Mrigank Raman2, Carlos Guestrin1,3, Zachary Lipton2 1Stanford University, 2Carnegie Mellon University, 3Chan Zuckerberg Biohub {ranjanr,guestrin}@stanford.edu, {sgarg2,mrigankr,zlipton}@cmu.edu
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/rishabh-ranjan/post-hoc-reversal.
Open Datasets Yes We focus on the CIFAR-N dataset [74]. CIFAR-10-N uses the same images as CIFAR-10 but provides multiple human-annotated label sets, allowing the study of realistic noise patterns of varying levels in a controlled manner. [...] FMo W [9, 34]. This is the version of the original FMo W dataset [9] as used in the WILDS benchmark [34]. [...] Guanaco [12]. This is a subset of the OASST1 dataset [37] containing only the highest-rated paths in the conversation tree. [...] Yelp [5]. This is a subset of the Yelp Dataset Challenge 2015 dataset with 25k reviews in the train set and 5k reviews each in the validation and test sets. [...] Folktables [14]. Folktables consists of 5 classification tasks based on the US Census: Income, Employment, Health, Travel Time and Public Coverage. [...] Collab and Reddit [52, 82]. These datasets are from TUDataset [52], and were originally introduced by Yanardag and Vishwanathan [82].
Dataset Splits Yes Training, validation and test sets are drawn i.i.d. from the data distribution D. [...] Table 2: Dataset Details. Modality Dataset Train Size Val Size Test Size Classes Input Size Units CIFAR-10 40000 5000 5000 10 3 32 32
Hardware Specification Yes We trained our models under these hyperparameters on 48 GB A6000 GPUs in a single-GPU setup, except for LLa MA-2-7B fine-tuning on Guanaco, for which we used 80 GB A100 GPUs.
Software Dependencies No The paper mentions using 'torchcal' and 'pytorch-minimize' but does not provide specific version numbers for these or other key software components like Python, PyTorch, or CUDA.
Experiment Setup Yes Table 3: Training Details. Dataset Model Pretrain Optimizer LR Weight Decay LR Schedule Epochs Batch Size C-10/100-N Res Net18-D [22] Yes SGD 0.1 5e-4 Cosine 100 500 FMo W Dense Net121 [26] Yes Adam 1e-4 0 Constant 50 64 Guanaco LLa MA-2-7B [70] Yes Adam 2e-4 0 Constant 6 16 Yelp BERT [13] Yes Adam W 5e-5 1e-2 Linear 25 16 Folktables MLP No Adam 0.01 0 Exponential 50 256 Collab GIN [80] No Adam 0.01 0 Exponential 500 128 Reddit GCN [33] No Adam 0.01 0 Exponential 500 128