Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Flow of Reasoning: Training LLMs for Divergent Reasoning with Minimal Examples
Authors: Fangxu Yu, Lai Jiang, Haoqiang Kang, Shibo Hao, Lianhui Qin
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that, with limited training examples (e.g., 15 examples), FOR enables the discovery of diverse, creative, high-quality solutions, greatly outperforming a wide range of existing inference and training methods across six challenging reasoning tasks, including Blocks World (embodied reasoning), Game24 (math puzzle solving), Rubik s Cube (spatial reasoning), 1D-ARC (abstraction reasoning), GSM8k (math reasoning), and Pronto QA (logical reasoning). [...] Empirical results show that FOR, with limited (e.g. 15) training examples, generates diverse, high-quality solutions, greatly outperforming a wide range of baselines with 20% 85% improvements, including supervised training methods like SFT, reward-maximizing RL like PPO, diversity-seeking approaches GFN-Co T and various decoding methods, and advanced inference methods like Co T, To T, Go T, and RAP. Ablation studies further validate the key designs in FOR that lead to robustness and effectiveness. |
| Researcher Affiliation | Academia | 1University of California San Diego. Correspondence to: Lianhui Qin <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 describes the training framework. |
| Open Source Code | Yes | Code is available at https://github.com/Yu-Fangxu/Fo R. |
| Open Datasets | Yes | Extensive experiments show that, with limited training examples (e.g., 15 examples), FOR enables the discovery of diverse, creative, high-quality solutions, greatly outperforming a wide range of existing inference and training methods across six challenging reasoning tasks, including Blocks World (embodied reasoning; Kambhampati et al., 2024), Game24 (math puzzle solving; Yao et al., 2024), Rubik s Cube (spatial reasoning; Ding et al., 2023), 1D-ARC (abstraction reasoning; Xu et al., 2023b), GSM8k (math reasoning; Cobbe et al., 2021), and Pronto QA (logical reasoning; Saparov & He, 2022; see Appendix 4.7). |
| Dataset Splits | Yes | Blocksworld examples (Valmeekam et al., 2024) are grouped by the minimum number of required actions: 30 examples for 2 steps, 57 for 4 steps, and 114 for 6 steps, following Hao et al. (2023). We select the first 15 of each group as the training examples for FOR and the rest as test examples. [...] We use the LLM-reasoner dataset (Hao et al., 2024) and randomly select 20 examples for training and 100 examples for testing. [...] We randomly select 15 examples from the training dataset in from (Ding et al., 2023), and evaluate different methods on a test set containing 183 examples. [...] We randomly select 5 examples from the 1d_move_1p, 1d_padded_fill, and 1d_denoising tasks. [...] The 15 selected examples form the training set, while the remaining 45 examples from each task form the test dataset. [...] We use the last 50 training examples in the original training dataset, and we sample 4 times for every problem at inference. [...] We randomly select 50 examples for the training set and 120 for the test set. |
| Hardware Specification | Yes | All experiments were conducted using a server with a single NVIDIA A100 GPU. |
| Software Dependencies | No | The paper mentions fine-tuning with LoRA and using various LLMs (Llama-3-8B, Qwen2.5-Math-PRM-7B) but does not provide specific version numbers for general software dependencies or libraries (e.g., Python, PyTorch versions). |
| Experiment Setup | Yes | During the training, we finetune the LLM with Lo RA (Hu et al., 2021) with r 32, α 64, and dropout=0.1. We set ϵ from 0.3 and decrease it to 0.01, β from 1 to 2, and the probability δ using replay buffer increases from 0.3 to 0.5 throughout the iterations linearly. The learning rate is set to 1e-4 with a cosine annealing schedule, and the number of training iterations is set to 10. Reward weight λ is set to 1.5. [...] We use Lo RA to train the model with r 8, α 32, dropout=0.1. We load the LLM in fp16, and set the hyperparameters as follows: batch size = 4, learning rate = 1e-5, number of epochs = 5, and the reward weight w 100. |