Information Re-Organization Improves Reasoning in Large Language Models
Authors: Xiaoxia Cheng, Zeqi Tan, Wei Xue, Weiming Lu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To demonstrate the effectiveness of our approach in improving the reasoning ability, we conduct experiments using Llama2-70B, GPT-3.5, and GPT-4 on various contextually aware multi-hop reasoning tasks. |
| Researcher Affiliation | Academia | Xiaoxia Cheng, Zeqi Tan, Wei Xue, Weiming Lu College of Computer Science and Technology Zhejiang University |
| Pseudocode | No | The paper describes the method using prose and equations, but it does not include a formal pseudocode block or algorithm listing. |
| Open Source Code | Yes | Our source code is available at https://github.com/hustcxx/Info RE. |
| Open Datasets | Yes | To verify the effectiveness of our information re-organization method, we conduct experiments across a range of contextually aware multi-hop reasoning tasks and datasets, including claim verification [12], question answering [13], and reading comprehension [14]. |
| Dataset Splits | No | The paper lists the datasets and the number of examples (pairs) in Appendix D Table 8. While these datasets have standard splits, the paper does not explicitly state the train/validation/test percentages or counts used for the main experiments, especially given the zero-shot LLM setting. For the RL training of the pruning model, it states "We train the model for 1000 episodes. We conduct training for epoch 5, a batch size of 4, and a learning rate of 2e-6.", but doesn't specify if this training used a validation set or how the data was split for this internal model. |
| Hardware Specification | Yes | All experiments are conducted on an NVIDIA RTX A6000. |
| Software Dependencies | Yes | The LLMs employed in the extraction and reasoning process include Llama2-70B [2], GPT-3.5 (text-davinci-003) [32] and GPT-4 [3]. We use the official version of Llama2-70B. The specific version of GPT-4 is GPT-4-0613. In the policy model, we use the BERT-base version on all tasks and datasets. |
| Experiment Setup | Yes | We configure all models with top_p parameter as 1.0 and temperature as 0.0. In RL training, we calculate the F1 score between the generated answer and the reference answer as the reward, with a rescaling coefficient of 10. We train the model for 1000 episodes. We conduct training for epoch 5, a batch size of 4, and a learning rate of 2e-6. The parameter of ϵ is set to 0.2. |