TaskLAMA: Probing the Complex Task Understanding of Language Models
Authors: Quan Yuan, Mehran Kazemi, Xin Xu, Isaac Noble, Vaiva Imbrasaite, Deepak Ramachandran
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments reveal that LLMs are able to decompose complex tasks into individual steps effectively, with a relative improvement of 15% to 280% over the best baseline. |
| Researcher Affiliation | Industry | Quan Yuan , Mehran Kazemi , Xin Xu , Isaac Noble, Vaiva Imbrasaite, Deepak Ramachandran Google Research {yquan, mehrankazemi, xxujasmine, isaacn, vimbrasaite, ramachandrand}@google.com |
| Pseudocode | No | The paper describes methods in textual paragraphs but does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a link for the dataset ('The full dataset can be downloaded from https://storage.googleapis.com/gresearch/tasklama/tasklama.zip') but does not explicitly state that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | The full dataset can be downloaded from https://storage.googleapis.com/gresearch/tasklama/tasklama.zip |
| Dataset Splits | Yes | We split the data into train, validation, and test sets in such a way that the tasks are conceptually different in the three sets... Following this splitting strategy, we ended up with 965 examples in the training set, 169 in the validation set, and 478 in the test set. |
| Hardware Specification | No | The paper discusses LLMs but does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions external tools and models like 'universal sentence encodings' and 'GloVe embeddings' with citations, but does not specify any software dependencies with version numbers. |
| Experiment Setup | Yes | setting the decoding temperature to 0.5 to allow for diverse generations... We learn the prompt embedding based on our training data and decide the size of the prompt based on performance on our validation set... The hyperparameter k is tuned on the validation set. |