Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Detecting Extrapolation with Local Ensembles
Authors: David Madras, James Atwood, Alexander D'Amour
ICLR 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally, we show that our method is capable of detecting when a pretrained model is extrapolating on test data, with applications to out-of-distribution detection, detecting spurious correlates, and active learning. |
| Researcher Affiliation | Collaboration | David Madras University of Toronto Vector Institute EMAIL James Atwood Google Brain EMAIL Alex D Amour Google Brain EMAIL |
| Pseudocode | Yes | B.1 LANCZOS ALGORITHM CODE SNIPPET. Figure 9: Example Python implementation of Lanczos algorithm for tridiagonalizing an implicit matrix M. |
| Open Source Code | Yes | Code for running the local ensembles method can be found at https://github.com/dmadras/local-ensembles. |
| Open Datasets | Yes | Boston (Harrison Jr & Rubinfeld, 1978) and Diabetes (Efron et al., 2004). These datasets were loaded from Scikit-Learn (Pedregosa et al., 2011). Abalone (Nash et al., 1994). This dataset was downloaded from the UCI repository (Dua & Graff, 2017) at http://archive.ics.uci.edu/ml/datasets/Abalone. Wine Quality (Cortez et al., 2009). This dataset was downloaded from the UCI repository (Dua & Graff, 2017) at http://archive.ics.uci.edu/ml/datasets/Wine+Quality. We use MNIST (Le Cun et al., 2010) and Fashion MNIST (Xiao et al., 2017) for our active learning experiments. We use the Celeb A dataset (Liu et al., 2015) of celebrity faces |
| Dataset Splits | Yes | We sample the validation set randomly as 20% of the training set. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions software like Python, NumPy, Scikit-learn, and TensorFlow Datasets, but it does not provide specific version numbers for these dependencies to ensure reproducibility. |
| Experiment Setup | Yes | We train a two-layer neural network with 3 hidden units in each layer and tanh units. We train for 400 optimization steps using minibatch size 32. We use batch size 64, patience 100 and a 100-step running average window for estimating current performance. For the Lanczos iteration, we run up to 2000 iterations. We use batch size 32, patience 100 steps, and a 100-step running average window for estimating current performance. We use two convolutional layers with 16 and 32 layers, stride size 5, and a dense layer on top with 64 units. We trained all models with mean squared error loss. |