STREET: A MULTI-TASK STRUCTURED REASONING AND EXPLANATION BENCHMARK

Authors: Danilo Neves Ribeiro, Shen Wang, Xiaofei Ma, Henghui Zhu, Rui Dong, Deguang Kong, Juliette Burger, Anjelica Ramos, zhiheng huang, William Yang Wang, George Karypis, Bing Xiang, Dan Roth

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform extensive evaluation with popular language models such as few-shot prompting GPT-3 and fine-tuned T5.
Researcher Affiliation Collaboration 1AWS AI Labs, {shenwa, xiaofeim, henghui, ruidong, kongdegu, burgerju}@amazon.com, {anjeramm, wyw, zhiheng, gkarypis, bxiang, drot}@amazon.com 2Northwestern University, {dnribeiro}@u.northwestern.edu,
Pseudocode Yes Algorithm 1 contains a pseudo-code for the script used to extract TLUs from the components of the questions.
Open Source Code Yes We make the data and evaluation code available online 1 https://github.com/amazon-science/street-reasoning
Open Datasets Yes We make the data and evaluation code available online 1 https://github.com/amazon-science/street-reasoning
Dataset Splits Yes The model is fine-tuned for up to 30 epochs, and we select the check-point with the highest answer accuracy on the development data at the end of each training epoch.
Hardware Specification Yes The training is done using a machine with four NVIDIA Tesla V100-SXM2, and the Hugging Face3 pre-trained T5-model distribution.
Software Dependencies Yes The training is done using a machine with four NVIDIA Tesla V100-SXM2, and the Hugging Face3 pre-trained T5-model distribution. For few-shot prompting we use GPT-3 (Brown et al., 2020) by accessing the Open AI s API 4. The API provides access to a few model variants. For our experiments we use the largest advertised model, namely text-davinci-002 (175B parameters). Since the conclusion nodes in these are in free text format, we follow Dalvi et al. (2021) and use the BLEURT (Sellam et al., 2020) text similarity function.
Experiment Setup Yes For full supervision, we fine-tune the T5-large model (770 million parameters) on the training data for each task separately. The model is fine-tuned for up to 30 epochs, and we select the check-point with the highest answer accuracy on the development data at the end of each training epoch. During inference, we use beam search with a beam size of 5 to generate the reasoning graph and the answer for a given question. During inference, we select up to 5 examples (depending on the tasks and models, fewer prompt examples might be provided due to the encoder token size limit) as prompts for the model... During generation, we use greedy decoding and do not set any maximum or minimum output size, expecting the model to predict the end of the structured output. We select the Adam W (Loshchilov & Hutter, 2019) as the optimizer. During training, we use batches containing two data points. The learning rate starts at zero and is gradually increased to its maximum value of 3 10 5. After 1000 steps, the learning rate is decreased following a cosine function scheduler. The weight decay is set to 10 3.