Towards Inferential Reproducibility of Machine Learning Research

Authors: Michael Hagmann, Philipp Meier, Stefan Riezler

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We exemplify the methods introduced above on an NLP example from the paperswithcode.com open resource, namely the BART+R3F fine-tuning algorithm presented by Aghajanyan et al. (2021) for the task of text summarization, evaluated on the CNN/Daily Mail (Hermann et al., 2015) and Reddit TIFU (Kim et al., 2019) datasets.
Researcher Affiliation Academia Michael Hagmann1, Philipp Meier1, Stefan Riezler1,2 Computational Linguistics1 & IWR2 Heidelberg University, Germany {hagmann,meier,riezler}@cl.uni-heidelberg.de
Pseudocode No The general form of an LMEM is Y = Xβ + Zb + ϵ, (1) where X is an (N k)-matrix and Z is an (N m)-matrix, called modelor design-matrices, which relate the unobserved vectors β and b to Y. β is a k-vector of fixed effects and b is an m-dimensional random vector called the random effects vector. ϵ is an N-dimensional vector called the error component. The random vectors are assumed to have the following distributions: b N(0, ψθ), ϵ N(0, Λθ), (2)
Open Source Code Yes Code (R and Python) for the toolkit and sample applications are publicly available.3
Open Datasets Yes We exemplify the methods introduced above on an NLP example from the paperswithcode.com open resource, namely the BART+R3F fine-tuning algorithm presented by Aghajanyan et al. (2021) for the task of text summarization, evaluated on the CNN/Daily Mail (Hermann et al., 2015) and Reddit TIFU (Kim et al., 2019) datasets.
Dataset Splits No The paper gives detailed meta-parameter settings for the text summarization experiments, but reports final results as maxima over training runs started from 10 unknown random seeds. Furthermore, the regularization parameter is specified as a choice of λ [0.001, 0.01, 0.1], and the noise type as a choice from [U, N]. Using the given settings, we started the BART+R3F code from 5 new random seeds and the BART-large baseline from 18 random seeds on 4 Nvidia Tesla V100 GPUs each with 32 GB RAM and a update frequency of 8. All models were trained for 20-30 epochs using a loss-based stopping criterion.
Hardware Specification Yes Using the given settings, we started the BART+R3F code from 5 new random seeds and the BART-large baseline from 18 random seeds on 4 Nvidia Tesla V100 GPUs each with 32 GB RAM and a update frequency of 8.
Software Dependencies No Code (R and Python) for the toolkit and sample applications are publicly available.
Experiment Setup Yes The paper gives detailed meta-parameter settings for the text summarization experiments, but reports final results as maxima over training runs started from 10 unknown random seeds. Furthermore, the regularization parameter is specified as a choice of λ [0.001, 0.01, 0.1], and the noise type as a choice from [U, N]. Using the given settings, we started the BART+R3F code from 5 new random seeds and the BART-large baseline from 18 random seeds on 4 Nvidia Tesla V100 GPUs each with 32 GB RAM and a update frequency of 8. All models were trained for 20-30 epochs using a loss-based stopping criterion.