reproducibilityindex.ai

Measuring Compositional Generalization: A Comprehensive Method on Realistic Data

Authors: Daniel Keysers, Nathanael Schärli, Nathan Scales, Hylke Buisman, Daniel Furrer, Sergii Kashubin, Nikola Momchev, Danila Sinopalnikov, Lukasz Stafiniak, Tibor Tihon, Dmitry Tsarkov, Xiao Wang, Marc van Zee, Olivier Bousquet

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present a large and realistic natural language question answering dataset that is constructed according to this method, and we use it to analyze the compositional generalization ability of three machine learning architectures. We ﬁnd that they fail to generalize compositionally and that there is a surprisingly strong negative correlation between compound divergence and accuracy.
Researcher Affiliation	Industry	Daniel Keysers, Nathanael Schärli, Nathan Scales, Hylke Buisman, Daniel Furrer, Sergii Kashubin, Nikola Momchev, Danila Sinopalnikov, Lukasz Staﬁniak, Tibor Tihon, Dmitry Tsarkov, Xiao Wang, Marc van Zee & Olivier Bousquet Google Research, Brain Team {keysers,schaerli,nkscales,hylke,danielfurrer,sergik,nikola,sinopalnikov, lukstafi,ttihon,tsar,wangxiao,marcvanzee,obousquet}@google.com
Pseudocode	No	The paper describes the generation algorithm in detail in Section K but does not present it as formal pseudocode or a clearly labeled algorithm block.
Open Source Code	Yes	We present the Compositional Freebase Questions (CFQ)1, a simple yet realistic and large NLU dataset that is speciﬁcally designed to measure compositional generalization using the DBCA method, and we describe how to construct such a dataset (Section 3). 1Available at https://github.com/google-research/google-research/tree/master/cfq
Open Datasets	Yes	We present the Compositional Freebase Questions (CFQ)1, a simple yet realistic and large NLU dataset that is speciﬁcally designed to measure compositional generalization using the DBCA method, and we describe how to construct such a dataset (Section 3). 1Available at https://github.com/google-research/google-research/tree/master/cfq and CFQ contains 239,357 English question-answer pairs that are answerable using the public Freebase data.
Dataset Splits	Yes	All of these experiments are based on the same train and validation/test sizes of 40% and 10% of the whole set, respectively. For CFQ, this corresponds to about 96k train and 12k validation and test examples, whereas for SCAN, it corresponds to about 8k train and 1k validation and test examples.
Hardware Specification	No	The paper mentions running experiments using the Tensor2Tensor framework, but it does not specify any particular hardware details such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper states that experiments were run using the 'tensor2tensor framework' and that its implementation is on 'tensorﬂow/tensor2tensor', but it does not specify version numbers for these software dependencies.
Experiment Setup	Yes	We tune the hyperparameters using a CFQ random split, and we keep the hyperparameters ﬁxed for both CFQ and SCAN (listed in Appendix E). In particular the number of training steps is kept constant to remove this factor of variation. We train a fresh model for each experiment, and we replicate each experiment 5 times and report the resulting mean accuracy with 95% conﬁdence intervals. and The hyperparameters used are summarized in Table 6.