Information Re-Organization Improves Reasoning in Large Language Models

Authors: Xiaoxia Cheng, Zeqi Tan, Wei Xue, Weiming Lu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To demonstrate the effectiveness of our approach in improving the reasoning ability, we conduct experiments using Llama2-70B, GPT-3.5, and GPT-4 on various contextually aware multi-hop reasoning tasks.
Researcher Affiliation Academia Xiaoxia Cheng, Zeqi Tan, Wei Xue, Weiming Lu College of Computer Science and Technology Zhejiang University
Pseudocode No The paper describes the method using prose and equations, but it does not include a formal pseudocode block or algorithm listing.
Open Source Code Yes Our source code is available at https://github.com/hustcxx/Info RE.
Open Datasets Yes To verify the effectiveness of our information re-organization method, we conduct experiments across a range of contextually aware multi-hop reasoning tasks and datasets, including claim verification [12], question answering [13], and reading comprehension [14].
Dataset Splits No The paper lists the datasets and the number of examples (pairs) in Appendix D Table 8. While these datasets have standard splits, the paper does not explicitly state the train/validation/test percentages or counts used for the main experiments, especially given the zero-shot LLM setting. For the RL training of the pruning model, it states "We train the model for 1000 episodes. We conduct training for epoch 5, a batch size of 4, and a learning rate of 2e-6.", but doesn't specify if this training used a validation set or how the data was split for this internal model.
Hardware Specification Yes All experiments are conducted on an NVIDIA RTX A6000.
Software Dependencies Yes The LLMs employed in the extraction and reasoning process include Llama2-70B [2], GPT-3.5 (text-davinci-003) [32] and GPT-4 [3]. We use the official version of Llama2-70B. The specific version of GPT-4 is GPT-4-0613. In the policy model, we use the BERT-base version on all tasks and datasets.
Experiment Setup Yes We configure all models with top_p parameter as 1.0 and temperature as 0.0. In RL training, we calculate the F1 score between the generated answer and the reference answer as the reward, with a rescaling coefficient of 10. We train the model for 1000 episodes. We conduct training for epoch 5, a batch size of 4, and a learning rate of 2e-6. The parameter of ϵ is set to 0.2.