reproducibilityindex.ai

QASA: Advanced Question Answering on Scientific Articles

Authors: Yoonjoo Lee, Kyungjae Lee, Sunghyun Park, Dasol Hwang, Jaehyeon Kim, Hong-In Lee, Moontae Lee

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results show that QASA s fullstack inference outperforms the state-of-the-art INSTRUCTGPT by a big margin.
Researcher Affiliation	Collaboration	1KAIST (Work done at LG AI Research) 2LG AI Research 3Yonsei University 4University of Illinois Chicago.
Pseudocode	No	The paper describes its methodology narratively, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code	No	The paper states, 'The dataset is available at https://github.com/lgresearch/QASA.', but it does not explicitly state that the source code for the QASA approach or its underlying methodology is publicly released or available.
Open Datasets	Yes	The dataset is available at https://github.com/lgresearch/QASA. ... we adopt S2ORC (Lo et al., 2020), a collection of machine-readable full text for open-access papers, and the arXiv1 paper collection. ... we exploit public and synthetic data for the purpose of each subtask. Table 2 shows a summary of used public data. Task Dataset Associative Selection QASPER, ASQA Rationale Generation QASPER Answer Composition ASQA, ELI5
Dataset Splits	No	The paper mentions selecting the best checkpoint based on the 'validation set' in Appendix C ('We trained all models until 5 epochs and selected the best checkpoint with average R-2 scores of answer composition on validation set.'), but it does not provide specific details on the dataset splits (percentages or counts) for its own QASA benchmark or for how the public datasets were partitioned for training, validation, and testing.
Hardware Specification	Yes	All of our experiments were conducted using 16 A100 GPUs.
Software Dependencies	No	The paper mentions using various large language models (T5, T0, FLAN-T5, GALACTICA, INSTRUCTGPT) and fine-tuning, but does not provide specific version numbers for any software dependencies, libraries, or programming languages used in the implementation.
Experiment Setup	Yes	To simplify all experiments, we fixed the initial learning rate to 1e-5. We trained all models until 5 epochs and selected the best checkpoint with average R-2 scores of answer composition on validation set.