reproducibilityindex.ai

Small Language Model Can Self-Correct

Authors: Haixia Han, Jiaqing Liang, Jie Shi, Qianyu He, Yanghua Xiao

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments using LMs with parameters sizes ranging from 6 billion to 13 billion in two tasks, including commonsense reasoning and factual knowledge reasoning. Our experiments demonstrate that the outputs generated using ISC outperform those generated without self-correction.
Researcher Affiliation	Academia	Haixia Han1, Jiaqing Liang2, Jie Shi3, Qianyu He3, Yanghua Xiao1,3* 1Shanghai Institute of AI for Education and School of Computer Science and Technology, East China Normal University 2School of Data Science, Fudan University 3 Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University haixiahan03@gmail.com, {liangjiaqing, shawyh}@fudan.edu.cn, {jshi22, qyhe21}@m.fudan.edu.cn
Pseudocode	No	The paper describes its methods, such as the data construction pipeline and Partial Answer Masking, in prose and with diagrams (Figure 2). However, it does not include any formal pseudocode blocks or algorithms labeled as such.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets	Yes	Datasets. We conduct experiments on two question-answering datasets, including Open Book QA1 and Commonsense QA2. Open Book QA is a science questionanswering dataset, containing 5,957 elementary-level science multiple-choice questions with 4 options each. These questions evaluate human comprehension of 1,326 core science facts and their application to novel scenarios. Commonsense QA is a single choice question-answering dataset that necessitates diverse forms of commonsense knowledge for accurate answer prediction. It comprises 12,102 questions with 5 choices each. 1http://data.allenai.org/Open Book QA 2https://www.tau-nlp.sites.tau.ac.il/commonsenseqa
Dataset Splits	No	After performing our proposed self-correction data construction process on the two datasets, the training data comprises about 15,000 self-correction samples. We use another about 1,700 examples as test data, of which 500 are from Open Book QA and 1,200 are from Commonsense QA. The paper specifies 'training data' and 'test data' but does not mention a distinct 'validation' or 'development' set for hyperparameter tuning or model selection.
Hardware Specification	No	The paper does not provide any specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper mentions using different base language models (e.g., Cute GPT, Llama2, Chat GLM, Vicuna) and fine-tuning techniques (full fine-tuning, LORA, Prompt-tuning). However, it does not provide specific version numbers for any software dependencies or libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	No	The paper states 'By employing identical data, model architecture, and hyper parameters settings, we conducted a comparative evaluation of the effects of the PAM on self-correction'. However, it does not explicitly list the specific values of these hyperparameters (e.g., learning rate, batch size, number of epochs, optimizer settings) or other detailed system-level training configurations.