reproducibilityindex.ai

Delving into the Reversal Curse: How Far Can Large Language Models Generalize?

Authors: Zhengkai Lin, Zhihang Fu, Kai Liu, Liang Xie, Binbin Lin, Wenxiao Wang, Deng Cai, Yue Wu, Jieping Ye

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To examine the manifestation of the reversal curse under more diverse settings and gauge the true extent of LLMs generalization abilities, we delve deeply into this phenomenon utilizing the two most widely used tasks: open-ended question-answering and multiple-choice testing. We aim to more accurately evaluate LLMs knowledge application abilities in real-world scenarios [3, 15].
Researcher Affiliation	Collaboration	Zhengkai Lin1,2 , Zhihang Fu2, Kai Liu2,3, Liang Xie4, Binbin Lin5,6, Wenxiao Wang5 , Deng Cai1, Yue Wu2, Jieping Ye2 1State Key Lab of CAD&CG, Zhejiang University, 2Alibaba Cloud, 3College of Biomedical Engineering & Instrument Science, Zhejiang University 4College of Computer Science and Technology, Zhejiang University of Technology 5School of Software Technology, Zhejiang University, 6Fullong Inc.
Pseudocode	No	No pseudocode or algorithm blocks were found.
Open Source Code	Yes	The code and data are available at https://github.com/alibaba/ thinking_bias.git.
Open Datasets	Yes	Berglund et al. [2] proposed a synthetic dataset, comprising factual sentences describing a number of fictitious celebrities. Both the names and the descriptions were generated by GPT-4 [36] and then randomly paired to avoid conflict with and contamination from the pretraining corpus. The training documents consist of two subsets3 with different structures4: Name Is Description subset: ... Description Is Name subset: ... More details about the training dataset can be found in Appendix A.
Dataset Splits	No	No specific training/test/validation dataset splits with explicit percentages or sample counts for a validation set were provided.
Hardware Specification	Yes	We finetune all models with full parameters for 3 epochs on 8 Nvidia A100 80G GPUs, with each run taking approximately 40 minutes.
Software Dependencies	No	The paper mentions 'Adam optimizer' and 'SpaCy' but does not provide specific version numbers for these or other software components.
Experiment Setup	Yes	For experiments in Table 1, we apply Adam optimizer [20] and set the learning rate to 7e-06 for LLa MA2-7B-chat and LLa MA2-13B-chat, 8e-06 for Vicuna-7B-v1.5 and Vicuna-13B-v1.5, and 1e-06 for Mistral-7B-Instruct-v0.1. The batch size is set to 16 for all models. Full hyperparameter configurations can be found in Table A5. We finetune all models with full parameters for 3 epochs on 8 Nvidia A100 80G GPUs...