reproducibilityindex.ai

TaskLAMA: Probing the Complex Task Understanding of Language Models

Authors: Quan Yuan, Mehran Kazemi, Xin Xu, Isaac Noble, Vaiva Imbrasaite, Deepak Ramachandran

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments reveal that LLMs are able to decompose complex tasks into individual steps effectively, with a relative improvement of 15% to 280% over the best baseline.
Researcher Affiliation	Industry	Quan Yuan , Mehran Kazemi , Xin Xu , Isaac Noble, Vaiva Imbrasaite, Deepak Ramachandran Google Research {yquan, mehrankazemi, xxujasmine, isaacn, vimbrasaite, ramachandrand}@google.com
Pseudocode	No	The paper describes methods in textual paragraphs but does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper provides a link for the dataset ('The full dataset can be downloaded from https://storage.googleapis.com/gresearch/tasklama/tasklama.zip') but does not explicitly state that the source code for the described methodology is publicly available.
Open Datasets	Yes	The full dataset can be downloaded from https://storage.googleapis.com/gresearch/tasklama/tasklama.zip
Dataset Splits	Yes	We split the data into train, validation, and test sets in such a way that the tasks are conceptually different in the three sets... Following this splitting strategy, we ended up with 965 examples in the training set, 169 in the validation set, and 478 in the test set.
Hardware Specification	No	The paper discusses LLMs but does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions external tools and models like 'universal sentence encodings' and 'GloVe embeddings' with citations, but does not specify any software dependencies with version numbers.
Experiment Setup	Yes	setting the decoding temperature to 0.5 to allow for diverse generations... We learn the prompt embedding based on our training data and decide the size of the prompt based on performance on our validation set... The hyperparameter k is tuned on the validation set.