STREET: A MULTI-TASK STRUCTURED REASONING AND EXPLANATION BENCHMARK
Authors: Danilo Neves Ribeiro, Shen Wang, Xiaofei Ma, Henghui Zhu, Rui Dong, Deguang Kong, Juliette Burger, Anjelica Ramos, zhiheng huang, William Yang Wang, George Karypis, Bing Xiang, Dan Roth
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform extensive evaluation with popular language models such as few-shot prompting GPT-3 and fine-tuned T5. |
| Researcher Affiliation | Collaboration | 1AWS AI Labs, {shenwa, xiaofeim, henghui, ruidong, kongdegu, burgerju}@amazon.com, {anjeramm, wyw, zhiheng, gkarypis, bxiang, drot}@amazon.com 2Northwestern University, {dnribeiro}@u.northwestern.edu, |
| Pseudocode | Yes | Algorithm 1 contains a pseudo-code for the script used to extract TLUs from the components of the questions. |
| Open Source Code | Yes | We make the data and evaluation code available online 1 https://github.com/amazon-science/street-reasoning |
| Open Datasets | Yes | We make the data and evaluation code available online 1 https://github.com/amazon-science/street-reasoning |
| Dataset Splits | Yes | The model is fine-tuned for up to 30 epochs, and we select the check-point with the highest answer accuracy on the development data at the end of each training epoch. |
| Hardware Specification | Yes | The training is done using a machine with four NVIDIA Tesla V100-SXM2, and the Hugging Face3 pre-trained T5-model distribution. |
| Software Dependencies | Yes | The training is done using a machine with four NVIDIA Tesla V100-SXM2, and the Hugging Face3 pre-trained T5-model distribution. For few-shot prompting we use GPT-3 (Brown et al., 2020) by accessing the Open AI s API 4. The API provides access to a few model variants. For our experiments we use the largest advertised model, namely text-davinci-002 (175B parameters). Since the conclusion nodes in these are in free text format, we follow Dalvi et al. (2021) and use the BLEURT (Sellam et al., 2020) text similarity function. |
| Experiment Setup | Yes | For full supervision, we fine-tune the T5-large model (770 million parameters) on the training data for each task separately. The model is fine-tuned for up to 30 epochs, and we select the check-point with the highest answer accuracy on the development data at the end of each training epoch. During inference, we use beam search with a beam size of 5 to generate the reasoning graph and the answer for a given question. During inference, we select up to 5 examples (depending on the tasks and models, fewer prompt examples might be provided due to the encoder token size limit) as prompts for the model... During generation, we use greedy decoding and do not set any maximum or minimum output size, expecting the model to predict the end of the structured output. We select the Adam W (Loshchilov & Hutter, 2019) as the optimizer. During training, we use batches containing two data points. The learning rate starts at zero and is gradually increased to its maximum value of 3 10 5. After 1000 steps, the learning rate is decreased following a cosine function scheduler. The weight decay is set to 10 3. |