reproducibilityindex.ai

Compositional Semantic Parsing with Large Language Models

Authors: Andrew Drozdov, Nathanael Schärli, Ekin Akyürek, Nathan Scales, Xinying Song, Xinyun Chen, Olivier Bousquet, Denny Zhou

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our approach on two realistic benchmarks that, like SCAN, are designed to measure compositional generalization: CFQ (Keysers et al., 2020) and COGS (Kim & Linzen, 2020). On CFQ, our best performing method outperforms previous fully supervised ﬁnetuning approaches and achieves a new state-of-the-art accuracy of 95% (averaged across MCD splits) and thereby reduced the error rate by about 45% compared to the previous best result while using about 1% of the training data as candidates for exemplars.
Researcher Affiliation	Collaboration	Andrew Drozdov1,2,* Nathanael Sch arli1,* Ekin Aky urek1,3 Nathan Scales1 Xinying Song1 Xinyun Chen1 Olivier Bousquet1 Denny Zhou1 1Google Research 2UMass Amherst CICS 3MIT CSAIL *Equal contribution
Pseudocode	No	The paper describes methods in prose and refers to 'a few lines of Python code' for putting parts together, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper states, 'Throughout our work we aim to provide exhaustive details about prompt design and exemplar selection, and we include all the prompts we use in the Appendix,' but it does not provide an explicit statement about releasing source code for the methodology described, nor a link to a code repository.
Open Datasets	Yes	We evaluate our approach on two realistic benchmarks that, like SCAN, are designed to measure compositional generalization: CFQ (Keysers et al., 2020) and COGS (Kim & Linzen, 2020).
Dataset Splits	Yes	CFQ (Keysers et al., 2020) has three maximum compound divergence splits (MCD1, MCD2, MCD3) for measuring compositional generalization, each with 95743/11968/11968 sentences in their train/validation/test splits.
Hardware Specification	No	The paper mentions using 'code-davinci-002 hosted by Open AI' for experiments, but it does not provide specific hardware details such as GPU models, CPU specifications, or memory.
Software Dependencies	No	The paper mentions using 'code-davinci-002' and 'Python code' but does not specify any software libraries or dependencies with version numbers.
Experiment Setup	Yes	Hyperparameters are summarized in Appendix D.3. ... Temperature = 0.7 when sampling, or 0.0 when greedy. Top-P = 1.0, Presence Penalty = 0, Frequency Penalty = 0. ... Number of Static Exemplars = 12 for CFQ, 28 for COGS Number of Dynamic Exemplars = 4-35 for CFQ, between 1-3 for COGS (these are determined automatically based on the decomposition tree) Number of Exemplar Lists = 1 Number of Generations per List = 1 Generation Mode = Greedy