Task Ambiguity in Humans and Language Models

Authors: Alex Tamkin, Kunal Handa, Avash Shrestha, Noah Goodman

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate humans and models on Ambi Bench by seeing how well they identify the intended task using 1) instructions with varying degrees of ambiguity, and 2) different numbers of labeled examples. We find that the combination of model scaling (to 175B parameters) and training with human feedback data enables models to approach or exceed the accuracy of human participants across tasks, but that either one alone is not sufficient.
Researcher Affiliation Academia Alex Tamkin , Kunal Handa , Avash Shrestha, Noah Goodman Stanford University
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code Yes We also release our codebase, including the benchmark data, at: https://github.com/kunhanda/task_ambiguity
Open Datasets Yes We also release our codebase, including the benchmark data, at: https://github.com/kunhanda/task_ambiguity
Dataset Splits Yes We partition the six Ambi Bench tasks into three folds, each containing four finetuning tasks and two evaluation tasks (following the feature pairs in Table 1). We finetune on 68 examples from each task (two for each number of examples, from 4 to 20), and evaluate on 240 examples randomly drawn from the other two tasks.
Hardware Specification No The paper mentions various language models by name and parameter count (e.g., 'davinci 175B', 'J1-Jumbo 178B') and the 'Open AI API' but does not specify the underlying hardware (e.g., specific GPU models, CPU types, or cloud instances) used for the experiments.
Software Dependencies No The paper mentions using 'Open AI API' for finetuning, but does not provide specific version numbers for any software libraries, frameworks, or programming languages used.
Experiment Setup Yes Hyperparameters We use Open AI s finetuning API to finetune their davinci model. When conducting the finetuning, we used a batch size of 1, a learning rate multiplier of 0.1, and a prompt loss weight of 0.1.