The Neural Testbed: Evaluating Joint Predictions

Authors: Ian Osband, Zheng Wen, Seyed Mohammad Asghari, Vikranth Dwaracherla, Xiuyuan Lu, MORTEZA IBRAHIMI, Dieterich Lawson, Botao Hao, Brendan O'Donoghue, Benjamin Van Roy

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate a range of agents using a simple neural network data generating process. Our results indicate that some popular Bayesian deep learning agents do not fare well with joint predictions, even when they can produce accurate marginal predictions.
Researcher Affiliation Industry Deep Mind, Efficient Agent Team, Mountain View
Pseudocode Yes Figure 3: Algorithm 1 KL-Loss Estimation
Open Source Code Yes Together with this conceptual contribution, we open-source code in Appendix A. This consists of highly optimized evaluation code, reference agent implementations and automated reproducible analysis.
Open Datasets No The Neural Testbed works by generating random classification problems using a neuralnetwork-based generative process. The paper emphasizes using a generative model to produce unlimited data, rather than relying on a fixed, publicly available dataset with concrete access information or a formal citation.
Dataset Splits No The paper states: 'The testbed splits data into a training set and testing set, allows a deep learning agent to train on the training set, and then evaluates the quality of the predictions on the testing set.' It does not explicitly mention a separate validation set or describe validation splits.
Hardware Specification No The paper states: 'Our experiments make extensive use of parallel computation to facilitate hyperparameter sweeps. Nevertheless, the overall computational cost is relatively low by modern deep learning standards and relies only on standard CPUs.' It does not provide specific models or detailed specifications for the hardware used.
Software Dependencies No The paper mentions 'The testbed uses JAX internally (Bradbury et al., 2018), but can be used to evaluate any python agent.' However, it does not specify version numbers for JAX or other key software components used in their experiments.
Experiment Setup Yes Table 1 lists agents that we study and compare as well as hyperparameters that we tune. In our experiments, we optimize these hyperparameters via grid search.