reproducibilityindex.ai

Model Performance Scaling with Multiple Data Sources

Authors: Tatsunori Hashimoto

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we show that the rational function approximation is a promising approach, and that the resulting predictions are accurate and hold under extrapolation. On the Amazon review prediction dataset (Mansour et al., 2009), we can learn to predict model performance nearly perfectly (r2 = 0.96) from a small dataset of 1200 examples across 3 sources and extrapolate to predict the model error on datasets of up to 4000 examples. We show this high accuracy continues to hold on a real-world task oriented dialogue system (r2 = 0.89), a multi-domain machine translation system (r2 = 0.83), and boolean question answering with weak supervision (r2 = 0.85).
Researcher Affiliation	Industry	Work done while author was at Microsoft Semantic Machines. Correspondence to: Tatsunori Hashimoto <vhashimotot@microsoft.com>.
Pseudocode	No	The paper describes the steps of its method but does not provide structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement about releasing source code for the methodology described, nor does it include a link to a code repository.
Open Datasets	Yes	On the Amazon review prediction dataset (Mansour et al., 2009), we can learn to predict model performance nearly perfectly (r2 = 0.96)... We perform this analysis on a real world task-oriented dialogue system that the SMCalFlow dataset and model (Andreas et al., 2020) is based on... Our task is the standard multi-domain machine translation dataset from Koehn & Knowles (2017)... The target task is the BoolQ question answering dataset, and we train this model using a combination of 4 data sources: the MNLI entailment task (Williams et al. (2018)...), STS sentence similarity judgment task (Cer et al. (2017)...), MRPC paraphrasing task (Dolan et al. (2004)...), and the BoolQ training set (Clark et al. (2019)...).
Dataset Splits	No	The paper describes training and testing sets, and implies some internal validation for fitting V(n,q) (e.g., 'goodness-of-ﬁt on a held out set' for Adagrad optimizer), but does not provide explicit, reproducible training/test/validation dataset splits (e.g., specific percentages or sample counts for all three partitions).
Hardware Specification	No	The paper does not provide specific details about the hardware used for running experiments, such as GPU/CPU models, memory, or specific cloud computing instances.
Software Dependencies	No	The paper mentions software like 'sacrebleu' and 'Jiant package' but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	We ﬁt V (n, q) with 4 terms for C(q) by minimizing the squared loss with respect to log-error on models containing 0-1200 examples total with ϵ = 0... We ﬁt this using the Adagrad (Duchi et al., 2010) optimizer with 20000 steps and learning rate set over the interval [0.005, 0.5] via goodness-of-ﬁt on a held out set. We re-parametrize the weights λ by log-transforming them for numerical stability, and initialize it with a Xavier initialization. This prevents degeneracies near λ = 0 and we empirically found the optimization process to be stable over the cross-validation range we used. We ﬁxed the number of factors in the rational approximation (M) to one greater than the number of data sources to reduce the number of hyperparameters to tune. We found ϵ = 0 to work well on the regression and classiﬁcation datasets, and we use this value throughout.