reproducibilityindex.ai

Adaptive Labeling for Efficient Out-of-distribution Model Evaluation

Authors: Daksh Mittal, Yuanzhe Ma, Shalmali Joshi, Hongseok Namkoong

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On synthetic and real datasets, we empirically demonstrate even a one-step lookahead policy substantially outperforms active learning-inspired heuristics. We empirically demonstrate effectiveness of our planning framework on both synthetic and real datasets by focusing on the simplest planning algorithm: 1-step lookaheads. In Figures 4 and 5, we present results for the simulated data. In Figures 6 and 7, we observe similar findings on the e ICU data.
Researcher Affiliation	Academia	Daksh Mittal , Yuanzhe Ma , Shalmali Joshi, Hongseok Namkoong Columbia University {dm3766, ym2865, sj3261, hn2369}@columbia.edu
Pseudocode	Yes	Algorithm 1 Autodiff 1-lookahead, Algorithm 2 One-step look-ahead policy gradient, Algorithm 3 Soft K-subset sampling algorithm, Algorithm 4 Weighted Gaussian process regression
Open Source Code	Yes	Our codebase is available at https://github.com/namkoong-lab/adaptive-labeling.
Open Datasets	Yes	On synthetic and real datasets, we empirically demonstrate even a one-step lookahead policy substantially outperforms active learning-inspired heuristics. We simulate selection bias from the e ICU dataset [30], which contains real-world patient data with in-hospital mortality outcomes. [30] T. J. Pollard, A. E. Johnson, J. D. Raffa, L. A. Celi, R. G. Mark, and O. Badawi. The e ICU Collaborative Research Database, a freely available multi-center database for critical care research. Scientific data, 5(1):1 13, 2018.
Dataset Splits	Yes	Our goal is to sequentially label batches of data to accurately estimate model performance over PX and therefore we assume we have access to a set of inputs Xeval PX. The setup includes 100 initial labeled data points, 500 pool points and 285 evaluation points used to estimate the objective. Our dataset consists 100 initial labeled data points, 500 pool points, and 285 evaluation points used to estimate the objective.
Hardware Specification	No	Due to the computational constraints, we test Ensemble+ on a toy setting to demonstrate the generalizability of our framework. This requires significant computational resources, in sharp contrast to the GPs where the posteriors are in closed form and can be readily updated and differentiated. The paper discusses computational resources in terms of efficiency/constraints but does not specify hardware.
Software Dependencies	No	For policy optimization in each horizon, we use the Adam optimizer with a learning rate of 0.1 to perform policy gradient steps over 100 epochs. To differentiate through the argmin operation, we employ the differentiable optimizer from the torchopt package, specifically the Met Adam Optimizer with a learning rate of 0.1. No version numbers for common libraries.
Experiment Setup	Yes	For soft K-subset sampling (see Algorithm 3), we set τ to 0.1. Further for evaluating the objective function Var(g(f)) we take 100 samples of f(Xeval) from the posterior state µa(θ) + , see Algorithm 2. For policy optimization in each horizon, we use the Adam optimizer with a learning rate of 0.1 to perform policy gradient steps over 100 epochs. We set ℓ= 1, σ2 f = 0.69, and σ2 = 0.01. We tune λ to 0.1 and use the Adam optimizer with a tuned learning rate of 0.1. Each model is trained for 50 iterations.