Compositional Semantic Parsing with Large Language Models
Authors: Andrew Drozdov, Nathanael Schärli, Ekin Akyürek, Nathan Scales, Xinying Song, Xinyun Chen, Olivier Bousquet, Denny Zhou
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach on two realistic benchmarks that, like SCAN, are designed to measure compositional generalization: CFQ (Keysers et al., 2020) and COGS (Kim & Linzen, 2020). On CFQ, our best performing method outperforms previous fully supervised finetuning approaches and achieves a new state-of-the-art accuracy of 95% (averaged across MCD splits) and thereby reduced the error rate by about 45% compared to the previous best result while using about 1% of the training data as candidates for exemplars. |
| Researcher Affiliation | Collaboration | Andrew Drozdov1,2,* Nathanael Sch arli1,* Ekin Aky urek1,3 Nathan Scales1 Xinying Song1 Xinyun Chen1 Olivier Bousquet1 Denny Zhou1 1Google Research 2UMass Amherst CICS 3MIT CSAIL *Equal contribution |
| Pseudocode | No | The paper describes methods in prose and refers to 'a few lines of Python code' for putting parts together, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states, 'Throughout our work we aim to provide exhaustive details about prompt design and exemplar selection, and we include all the prompts we use in the Appendix,' but it does not provide an explicit statement about releasing source code for the methodology described, nor a link to a code repository. |
| Open Datasets | Yes | We evaluate our approach on two realistic benchmarks that, like SCAN, are designed to measure compositional generalization: CFQ (Keysers et al., 2020) and COGS (Kim & Linzen, 2020). |
| Dataset Splits | Yes | CFQ (Keysers et al., 2020) has three maximum compound divergence splits (MCD1, MCD2, MCD3) for measuring compositional generalization, each with 95743/11968/11968 sentences in their train/validation/test splits. |
| Hardware Specification | No | The paper mentions using 'code-davinci-002 hosted by Open AI' for experiments, but it does not provide specific hardware details such as GPU models, CPU specifications, or memory. |
| Software Dependencies | No | The paper mentions using 'code-davinci-002' and 'Python code' but does not specify any software libraries or dependencies with version numbers. |
| Experiment Setup | Yes | Hyperparameters are summarized in Appendix D.3. ... Temperature = 0.7 when sampling, or 0.0 when greedy. Top-P = 1.0, Presence Penalty = 0, Frequency Penalty = 0. ... Number of Static Exemplars = 12 for CFQ, 28 for COGS Number of Dynamic Exemplars = 4-35 for CFQ, between 1-3 for COGS (these are determined automatically based on the decomposition tree) Number of Exemplar Lists = 1 Number of Generations per List = 1 Generation Mode = Greedy |