TaskLAMA: Probing the Complex Task Understanding of Language Models

Authors: Quan Yuan, Mehran Kazemi, Xin Xu, Isaac Noble, Vaiva Imbrasaite, Deepak Ramachandran

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments reveal that LLMs are able to decompose complex tasks into individual steps effectively, with a relative improvement of 15% to 280% over the best baseline.
Researcher Affiliation Industry Quan Yuan , Mehran Kazemi , Xin Xu , Isaac Noble, Vaiva Imbrasaite, Deepak Ramachandran Google Research {yquan, mehrankazemi, xxujasmine, isaacn, vimbrasaite, ramachandrand}@google.com
Pseudocode No The paper describes methods in textual paragraphs but does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper provides a link for the dataset ('The full dataset can be downloaded from https://storage.googleapis.com/gresearch/tasklama/tasklama.zip') but does not explicitly state that the source code for the described methodology is publicly available.
Open Datasets Yes The full dataset can be downloaded from https://storage.googleapis.com/gresearch/tasklama/tasklama.zip
Dataset Splits Yes We split the data into train, validation, and test sets in such a way that the tasks are conceptually different in the three sets... Following this splitting strategy, we ended up with 965 examples in the training set, 169 in the validation set, and 478 in the test set.
Hardware Specification No The paper discusses LLMs but does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions external tools and models like 'universal sentence encodings' and 'GloVe embeddings' with citations, but does not specify any software dependencies with version numbers.
Experiment Setup Yes setting the decoding temperature to 0.5 to allow for diverse generations... We learn the prompt embedding based on our training data and decide the size of the prompt based on performance on our validation set... The hyperparameter k is tuned on the validation set.