Sequential Learning of Neural Networks for Prequential MDL

Authors: Jorg Bornschein, Yazhe Li, Marcus Hutter

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this study, we evaluate approaches for computing prequential description lengths for image classification datasets with neural networks. Considering the computational cost, we find that online-learning with rehearsal has favorable performance compared to the previously widely used block-wise estimation.
Researcher Affiliation Industry Jorg Bornschein bornschein@deepmind.com Yazhe Li yazhe@deepmind.com Marcus Hutter mhutter@deepmind.com
Pseudocode Yes Algorithm 1 Mini-batch Incremental Training with Replay Streams
Open Source Code No The paper does not provide an explicit statement or link to its own open-source code.
Open Datasets Yes We use MNIST (Le Cun et al., 2010), EMNIST, CIFAR-10, CIFAR-100 (Krizhevsky, 2009) and Image Net (Russakovsky et al., 2015) and randomly shuffle each into a fixed sequence of examples.
Dataset Splits Yes at each stage we split the data D<sk into a 90% training and a 10% calibration data. Conceptually, we could perform post-calibration by first training the network to convergence and then, with all parameters frozen, replacing the output layer softmax(h) with the calibrated output layer softmax(softplus(β)h), where β is a scalar parameter chosen to minimize the loss on calibration data.
Hardware Specification No The paper does not explicitly describe the specific hardware used for experiments, such as GPU or CPU models. It mentions training on 'a GPU' or in a 'data center' implicitly but lacks detail.
Software Dependencies No The paper mentions software like 'Adam W' and 'RMSProp' as optimizers, and 'randaugment' for data augmentation, but it does not specify any version numbers for these software dependencies (e.g., PyTorch 1.9 or Python 3.8).
Experiment Setup Yes The hyperparameter intervals depend on the data and are detailed in Appendix B. We sample learning rate, EMA step size, batch size, weight decay; but crucially also number of epochs (or, correspondingly, number of replay streams for MI/RS) and an overall scaling of the model width (number of channels).