Measuring Compositional Generalization: A Comprehensive Method on Realistic Data
Authors: Daniel Keysers, Nathanael Schärli, Nathan Scales, Hylke Buisman, Daniel Furrer, Sergii Kashubin, Nikola Momchev, Danila Sinopalnikov, Lukasz Stafiniak, Tibor Tihon, Dmitry Tsarkov, Xiao Wang, Marc van Zee, Olivier Bousquet
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a large and realistic natural language question answering dataset that is constructed according to this method, and we use it to analyze the compositional generalization ability of three machine learning architectures. We find that they fail to generalize compositionally and that there is a surprisingly strong negative correlation between compound divergence and accuracy. |
| Researcher Affiliation | Industry | Daniel Keysers, Nathanael Schärli, Nathan Scales, Hylke Buisman, Daniel Furrer, Sergii Kashubin, Nikola Momchev, Danila Sinopalnikov, Lukasz Stafiniak, Tibor Tihon, Dmitry Tsarkov, Xiao Wang, Marc van Zee & Olivier Bousquet Google Research, Brain Team {keysers,schaerli,nkscales,hylke,danielfurrer,sergik,nikola,sinopalnikov, lukstafi,ttihon,tsar,wangxiao,marcvanzee,obousquet}@google.com |
| Pseudocode | No | The paper describes the generation algorithm in detail in Section K but does not present it as formal pseudocode or a clearly labeled algorithm block. |
| Open Source Code | Yes | We present the Compositional Freebase Questions (CFQ)1, a simple yet realistic and large NLU dataset that is specifically designed to measure compositional generalization using the DBCA method, and we describe how to construct such a dataset (Section 3). 1Available at https://github.com/google-research/google-research/tree/master/cfq |
| Open Datasets | Yes | We present the Compositional Freebase Questions (CFQ)1, a simple yet realistic and large NLU dataset that is specifically designed to measure compositional generalization using the DBCA method, and we describe how to construct such a dataset (Section 3). 1Available at https://github.com/google-research/google-research/tree/master/cfq and CFQ contains 239,357 English question-answer pairs that are answerable using the public Freebase data. |
| Dataset Splits | Yes | All of these experiments are based on the same train and validation/test sizes of 40% and 10% of the whole set, respectively. For CFQ, this corresponds to about 96k train and 12k validation and test examples, whereas for SCAN, it corresponds to about 8k train and 1k validation and test examples. |
| Hardware Specification | No | The paper mentions running experiments using the Tensor2Tensor framework, but it does not specify any particular hardware details such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper states that experiments were run using the 'tensor2tensor framework' and that its implementation is on 'tensorflow/tensor2tensor', but it does not specify version numbers for these software dependencies. |
| Experiment Setup | Yes | We tune the hyperparameters using a CFQ random split, and we keep the hyperparameters fixed for both CFQ and SCAN (listed in Appendix E). In particular the number of training steps is kept constant to remove this factor of variation. We train a fresh model for each experiment, and we replicate each experiment 5 times and report the resulting mean accuracy with 95% confidence intervals. and The hyperparameters used are summarized in Table 6. |