reproducibilityindex.ai

Task Ambiguity in Humans and Language Models

Authors: Alex Tamkin, Kunal Handa, Avash Shrestha, Noah Goodman

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate humans and models on Ambi Bench by seeing how well they identify the intended task using 1) instructions with varying degrees of ambiguity, and 2) different numbers of labeled examples. We find that the combination of model scaling (to 175B parameters) and training with human feedback data enables models to approach or exceed the accuracy of human participants across tasks, but that either one alone is not sufficient.
Researcher Affiliation	Academia	Alex Tamkin , Kunal Handa , Avash Shrestha, Noah Goodman Stanford University
Pseudocode	No	The paper does not contain any pseudocode or algorithm blocks.
Open Source Code	Yes	We also release our codebase, including the benchmark data, at: https://github.com/kunhanda/task_ambiguity
Open Datasets	Yes	We also release our codebase, including the benchmark data, at: https://github.com/kunhanda/task_ambiguity
Dataset Splits	Yes	We partition the six Ambi Bench tasks into three folds, each containing four finetuning tasks and two evaluation tasks (following the feature pairs in Table 1). We finetune on 68 examples from each task (two for each number of examples, from 4 to 20), and evaluate on 240 examples randomly drawn from the other two tasks.
Hardware Specification	No	The paper mentions various language models by name and parameter count (e.g., 'davinci 175B', 'J1-Jumbo 178B') and the 'Open AI API' but does not specify the underlying hardware (e.g., specific GPU models, CPU types, or cloud instances) used for the experiments.
Software Dependencies	No	The paper mentions using 'Open AI API' for finetuning, but does not provide specific version numbers for any software libraries, frameworks, or programming languages used.
Experiment Setup	Yes	Hyperparameters We use Open AI s finetuning API to finetune their davinci model. When conducting the finetuning, we used a batch size of 1, a learning rate multiplier of 0.1, and a prompt loss weight of 0.1.