Adaptive Labeling for Efficient Out-of-distribution Model Evaluation
Authors: Daksh Mittal, Yuanzhe Ma, Shalmali Joshi, Hongseok Namkoong
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On synthetic and real datasets, we empirically demonstrate even a one-step lookahead policy substantially outperforms active learning-inspired heuristics. We empirically demonstrate effectiveness of our planning framework on both synthetic and real datasets by focusing on the simplest planning algorithm: 1-step lookaheads. In Figures 4 and 5, we present results for the simulated data. In Figures 6 and 7, we observe similar findings on the e ICU data. |
| Researcher Affiliation | Academia | Daksh Mittal , Yuanzhe Ma , Shalmali Joshi, Hongseok Namkoong Columbia University {dm3766, ym2865, sj3261, hn2369}@columbia.edu |
| Pseudocode | Yes | Algorithm 1 Autodiff 1-lookahead, Algorithm 2 One-step look-ahead policy gradient, Algorithm 3 Soft K-subset sampling algorithm, Algorithm 4 Weighted Gaussian process regression |
| Open Source Code | Yes | Our codebase is available at https://github.com/namkoong-lab/adaptive-labeling. |
| Open Datasets | Yes | On synthetic and real datasets, we empirically demonstrate even a one-step lookahead policy substantially outperforms active learning-inspired heuristics. We simulate selection bias from the e ICU dataset [30], which contains real-world patient data with in-hospital mortality outcomes. [30] T. J. Pollard, A. E. Johnson, J. D. Raffa, L. A. Celi, R. G. Mark, and O. Badawi. The e ICU Collaborative Research Database, a freely available multi-center database for critical care research. Scientific data, 5(1):1 13, 2018. |
| Dataset Splits | Yes | Our goal is to sequentially label batches of data to accurately estimate model performance over PX and therefore we assume we have access to a set of inputs Xeval PX. The setup includes 100 initial labeled data points, 500 pool points and 285 evaluation points used to estimate the objective. Our dataset consists 100 initial labeled data points, 500 pool points, and 285 evaluation points used to estimate the objective. |
| Hardware Specification | No | Due to the computational constraints, we test Ensemble+ on a toy setting to demonstrate the generalizability of our framework. This requires significant computational resources, in sharp contrast to the GPs where the posteriors are in closed form and can be readily updated and differentiated. The paper discusses computational resources in terms of efficiency/constraints but does not specify hardware. |
| Software Dependencies | No | For policy optimization in each horizon, we use the Adam optimizer with a learning rate of 0.1 to perform policy gradient steps over 100 epochs. To differentiate through the argmin operation, we employ the differentiable optimizer from the torchopt package, specifically the Met Adam Optimizer with a learning rate of 0.1. No version numbers for common libraries. |
| Experiment Setup | Yes | For soft K-subset sampling (see Algorithm 3), we set τ to 0.1. Further for evaluating the objective function Var(g(f)) we take 100 samples of f(Xeval) from the posterior state µa(θ) + , see Algorithm 2. For policy optimization in each horizon, we use the Adam optimizer with a learning rate of 0.1 to perform policy gradient steps over 100 epochs. We set ℓ= 1, σ2 f = 0.69, and σ2 = 0.01. We tune λ to 0.1 and use the Adam optimizer with a tuned learning rate of 0.1. Each model is trained for 50 iterations. |