reproducibilityindex.ai

Zero and Few-shot Semantic Parsing with Ambiguous Inputs

Authors: Elias Stengel-Eskin, Kyle Rawlins, Benjamin Van Durme

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using AMP, we investigate how several few-shot text-to-code systems handle ambiguity, introducing three new metrics. We find that large pre-trained models perform poorly at capturing the distribution of possible meanings without deliberate instruction. and 3 EXPERIMENT 1: ZERO-SHOT PARSING and 4 EXPERIMENT 2: FEW-SHOT PARSING
Researcher Affiliation	Academia	Elias Stengel-Eskin1 Kyle Rawlins2 Benjamin Van Durme2 1UNC Chapel Hill 2Johns Hopkins University
Pseudocode	No	No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code	Yes	Data and code: https://github.com/esteng/ambiguous_parsing. Contact: esteng@cs.unc.edu and In order to further reproducibility, we release our dataset and our code. This includes the code for AMP, which can be extended to generate new ambiguities, as well as the code for running all experiments.
Open Datasets	Yes	We use our framework to create a benchmark dataset we call AMP (Ambiguous Parsing). and In order to further reproducibility, we release our dataset and our code. This includes the code for AMP... and Data and code: https://github.com/esteng/ambiguous_parsing.
Dataset Splits	Yes	We annotate a subset of our validation examples with human interpretations and confidence scores. and For each ambiguity type except conjunctions, we randomly select 20 examples from our development splits.
Hardware Specification	No	No specific hardware details (e.g., exact GPU/CPU models, memory amounts, or cloud instance types) used for running experiments are mentioned. The paper only mentions 'All models above 350M were run at fp16 precision', which is a precision setting, not hardware.
Software Dependencies	No	The paper mentions models (Codegen, Llama-13B, Vicuna-13B, gpt-3.5-turbo) and frameworks (Bench CLAMP) but does not provide specific version numbers for any software dependencies like programming languages, libraries, or detailed framework versions necessary for replication.
Experiment Setup	Yes	Each prompt contains 10 input-LF pairs, and a different prompt is constructed for each test sentence. We vary the number of LF0 sentences in the prompt from 0 to 10 in increments of 1 (e.g. 0 100%) and shuffle the prompt sentences to ensure that there is no positional bias. and We decode with beam search, using a beam of 5.