On Masked Pre-training and the Marginal Likelihood

Authors: Pablo Moreno-Muñoz, Pol Garcia Recasens, Søren Hauberg

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we confirm the developed theory and explore the main learning principles of masked pre-training in large language models. ... The results in Fig. 1 and Tab. 1 indicate that as long as we average over more random masking patterns, the cumulative MPT loss approximates the LML of the model very well. ... We provide the code and details for every figure in the public repository at https://github.com/pmorenoz/MPT-LML/.
Researcher Affiliation Collaboration Pablo Moreno-Muñoz Section for Cognitive Systems Technical University of Denmark (DTU) pabmo@dtu.dk Pol G. Recasens CROMAI, Barcelona Supercomputing Center Universitat Politècnica de Catalunya (UPC) pol.garcia@bsc.es Søren Hauberg Section for Cognitive Systems Technical University of Denmark (DTU) sohau@dtu.dk
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes All the empirical studies and results are reproducible. We provide the code and details for every figure in the public repository at https://github.com/pmorenoz/MPT-LML/.
Open Datasets Yes For the mini-dataset with MNIST samples... Data consist of subsets of MNIST and FMNIST. ... For the study of the curves in LLMs, we used four datasets from the General Language Understanding Evaluation (GLUE) (Wang et al., 2019). ... samples from three different test image datasets (FASHION-MNIST, CIFAR-100 and TINY-IMAGENET).
Dataset Splits No The paper mentions using well-known datasets like MNIST, FMNIST, GLUE datasets, CIFAR-100, and TINY-IMAGENET, but it does not specify the exact percentages or counts for training, validation, or test splits, nor does it refer to a predefined split with a citation that includes authors and year.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments. It does not mention any cloud or cluster specifications with hardware details.
Software Dependencies No The paper mentions models like BERT and VIT-MAE and implies the use of libraries like HuggingFace for pre-trained checkpoints, but it does not specify version numbers for any software dependencies, programming languages, or libraries used for the experiments.
Experiment Setup No The paper does not provide specific experimental setup details such as hyperparameter values (e.g., learning rate, batch size, number of epochs) or explicit optimizer settings within the main text. It mentions using 'standard MPT' or 'pre-trained checkpoints' but not their specific configurations during the experiments described.