reproducibilityindex.ai

Information Re-Organization Improves Reasoning in Large Language Models

Authors: Xiaoxia Cheng, Zeqi Tan, Wei Xue, Weiming Lu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To demonstrate the effectiveness of our approach in improving the reasoning ability, we conduct experiments using Llama2-70B, GPT-3.5, and GPT-4 on various contextually aware multi-hop reasoning tasks.
Researcher Affiliation	Academia	Xiaoxia Cheng, Zeqi Tan, Wei Xue, Weiming Lu College of Computer Science and Technology Zhejiang University
Pseudocode	No	The paper describes the method using prose and equations, but it does not include a formal pseudocode block or algorithm listing.
Open Source Code	Yes	Our source code is available at https://github.com/hustcxx/Info RE.
Open Datasets	Yes	To verify the effectiveness of our information re-organization method, we conduct experiments across a range of contextually aware multi-hop reasoning tasks and datasets, including claim verification [12], question answering [13], and reading comprehension [14].
Dataset Splits	No	The paper lists the datasets and the number of examples (pairs) in Appendix D Table 8. While these datasets have standard splits, the paper does not explicitly state the train/validation/test percentages or counts used for the main experiments, especially given the zero-shot LLM setting. For the RL training of the pruning model, it states "We train the model for 1000 episodes. We conduct training for epoch 5, a batch size of 4, and a learning rate of 2e-6.", but doesn't specify if this training used a validation set or how the data was split for this internal model.
Hardware Specification	Yes	All experiments are conducted on an NVIDIA RTX A6000.
Software Dependencies	Yes	The LLMs employed in the extraction and reasoning process include Llama2-70B [2], GPT-3.5 (text-davinci-003) [32] and GPT-4 [3]. We use the official version of Llama2-70B. The specific version of GPT-4 is GPT-4-0613. In the policy model, we use the BERT-base version on all tasks and datasets.
Experiment Setup	Yes	We configure all models with top_p parameter as 1.0 and temperature as 0.0. In RL training, we calculate the F1 score between the generated answer and the reference answer as the reward, with a rescaling coefficient of 10. We train the model for 1000 episodes. We conduct training for epoch 5, a batch size of 4, and a learning rate of 2e-6. The parameter of ϵ is set to 0.2.