Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Training independent subnetworks for robust prediction
Authors: Marton Havasi, Rodolphe Jenatton, Stanislav Fort, Jeremiah Zhe Liu, Jasper Snoek, Balaji Lakshminarayanan, Andrew Mingbo Dai, Dustin Tran
ICLR 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We observe a significant improvement in negative log-likelihood, accuracy, and calibration error on CIFAR10, CIFAR100, Image Net, and their out-of-distribution variants compared to previous methods. |
| Researcher Affiliation | Collaboration | Marton Havasi Department of Engineering University of Cambridge EMAIL Rodolphe Jenatton Google Research EMAIL Stanislav Fort Stanford University EMAIL Jeremiah Zhe Liu Google Research & Harvard University EMAIL Jasper Snoek Google Research EMAIL Balaji Lakshminarayanan Google Research EMAIL Andrew M. Dai Google Research EMAIL Dustin Tran Google Research EMAIL |
| Pseudocode | Yes | Algorithm 1 Train(X) ... Algorithm 2 Evaluate(x ) |
| Open Source Code | Yes | MIMO s code is open-sourced. https://github.com/google/edward2/tree/master/experimental/mimo |
| Open Datasets | Yes | We observe a significant improvement in negative log-likelihood, accuracy, and calibration error on CIFAR10, CIFAR100, Image Net, and their out-of-distribution variants compared to previous methods. |
| Dataset Splits | No | The paper explicitly mentions using 'training' and 'test' sets for the synthetic example and refers to 'CIFAR10', 'CIFAR100', and 'Image Net' for main experiments, but it does not specify a distinct validation set split or methodology for its main experiments. |
| Hardware Specification | Yes | To measure computational cost, we look at how long it takes to evaluate the model on a TPUv2 core, measured in ms per example. |
| Software Dependencies | No | The paper mentions using 'Uncertainty Baselines' framework but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | For the Res Net28-10/CIFAR models, we use a batch-size of 512, a decaying learning rate of 0.1 (decay rate 0.1) and L2 regularization 2e-4. The Deterministic, Dropout and Ensemble models are trained for 200 epochs while Batch Ensemble, Naive multihead and Tree Net are trained for 250 epochs. For MIMO, we use the hyperparameters of the baseline implementations wherever possible. For the Res Net28-10/CIFAR models, we use a batch-size of 512 with decaying learning rate of 0.1 (decay rate 0.1), L2 regularization 3e-4, 250 training epochs, and a batch repetition of 4. For the Res Net50/Image Net models, we use a batch-size of 4096 with decaying learning rate of 0.1 (decay rate 0.1), L2 regularization 1e-4, 150 training epochs, and batch repetition of 2. |