Task Ambiguity in Humans and Language Models
Authors: Alex Tamkin, Kunal Handa, Avash Shrestha, Noah Goodman
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate humans and models on Ambi Bench by seeing how well they identify the intended task using 1) instructions with varying degrees of ambiguity, and 2) different numbers of labeled examples. We find that the combination of model scaling (to 175B parameters) and training with human feedback data enables models to approach or exceed the accuracy of human participants across tasks, but that either one alone is not sufficient. |
| Researcher Affiliation | Academia | Alex Tamkin , Kunal Handa , Avash Shrestha, Noah Goodman Stanford University |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | Yes | We also release our codebase, including the benchmark data, at: https://github.com/kunhanda/task_ambiguity |
| Open Datasets | Yes | We also release our codebase, including the benchmark data, at: https://github.com/kunhanda/task_ambiguity |
| Dataset Splits | Yes | We partition the six Ambi Bench tasks into three folds, each containing four finetuning tasks and two evaluation tasks (following the feature pairs in Table 1). We finetune on 68 examples from each task (two for each number of examples, from 4 to 20), and evaluate on 240 examples randomly drawn from the other two tasks. |
| Hardware Specification | No | The paper mentions various language models by name and parameter count (e.g., 'davinci 175B', 'J1-Jumbo 178B') and the 'Open AI API' but does not specify the underlying hardware (e.g., specific GPU models, CPU types, or cloud instances) used for the experiments. |
| Software Dependencies | No | The paper mentions using 'Open AI API' for finetuning, but does not provide specific version numbers for any software libraries, frameworks, or programming languages used. |
| Experiment Setup | Yes | Hyperparameters We use Open AI s finetuning API to finetune their davinci model. When conducting the finetuning, we used a batch size of 1, a learning rate multiplier of 0.1, and a prompt loss weight of 0.1. |