Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Flow of Reasoning: Training LLMs for Divergent Reasoning with Minimal Examples

Authors: Fangxu Yu, Lai Jiang, Haoqiang Kang, Shibo Hao, Lianhui Qin

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that, with limited training examples (e.g., 15 examples), FOR enables the discovery of diverse, creative, high-quality solutions, greatly outperforming a wide range of existing inference and training methods across six challenging reasoning tasks, including Blocks World (embodied reasoning), Game24 (math puzzle solving), Rubik s Cube (spatial reasoning), 1D-ARC (abstraction reasoning), GSM8k (math reasoning), and Pronto QA (logical reasoning). [...] Empirical results show that FOR, with limited (e.g. 15) training examples, generates diverse, high-quality solutions, greatly outperforming a wide range of baselines with 20% 85% improvements, including supervised training methods like SFT, reward-maximizing RL like PPO, diversity-seeking approaches GFN-Co T and various decoding methods, and advanced inference methods like Co T, To T, Go T, and RAP. Ablation studies further validate the key designs in FOR that lead to robustness and effectiveness.
Researcher Affiliation Academia 1University of California San Diego. Correspondence to: Lianhui Qin <EMAIL>.
Pseudocode Yes Algorithm 1 describes the training framework.
Open Source Code Yes Code is available at https://github.com/Yu-Fangxu/Fo R.
Open Datasets Yes Extensive experiments show that, with limited training examples (e.g., 15 examples), FOR enables the discovery of diverse, creative, high-quality solutions, greatly outperforming a wide range of existing inference and training methods across six challenging reasoning tasks, including Blocks World (embodied reasoning; Kambhampati et al., 2024), Game24 (math puzzle solving; Yao et al., 2024), Rubik s Cube (spatial reasoning; Ding et al., 2023), 1D-ARC (abstraction reasoning; Xu et al., 2023b), GSM8k (math reasoning; Cobbe et al., 2021), and Pronto QA (logical reasoning; Saparov & He, 2022; see Appendix 4.7).
Dataset Splits Yes Blocksworld examples (Valmeekam et al., 2024) are grouped by the minimum number of required actions: 30 examples for 2 steps, 57 for 4 steps, and 114 for 6 steps, following Hao et al. (2023). We select the first 15 of each group as the training examples for FOR and the rest as test examples. [...] We use the LLM-reasoner dataset (Hao et al., 2024) and randomly select 20 examples for training and 100 examples for testing. [...] We randomly select 15 examples from the training dataset in from (Ding et al., 2023), and evaluate different methods on a test set containing 183 examples. [...] We randomly select 5 examples from the 1d_move_1p, 1d_padded_fill, and 1d_denoising tasks. [...] The 15 selected examples form the training set, while the remaining 45 examples from each task form the test dataset. [...] We use the last 50 training examples in the original training dataset, and we sample 4 times for every problem at inference. [...] We randomly select 50 examples for the training set and 120 for the test set.
Hardware Specification Yes All experiments were conducted using a server with a single NVIDIA A100 GPU.
Software Dependencies No The paper mentions fine-tuning with LoRA and using various LLMs (Llama-3-8B, Qwen2.5-Math-PRM-7B) but does not provide specific version numbers for general software dependencies or libraries (e.g., Python, PyTorch versions).
Experiment Setup Yes During the training, we finetune the LLM with Lo RA (Hu et al., 2021) with r 32, α 64, and dropout=0.1. We set ϵ from 0.3 and decrease it to 0.01, β from 1 to 2, and the probability δ using replay buffer increases from 0.3 to 0.5 throughout the iterations linearly. The learning rate is set to 1e-4 with a cosine annealing schedule, and the number of training iterations is set to 10. Reward weight λ is set to 1.5. [...] We use Lo RA to train the model with r 8, α 32, dropout=0.1. We load the LLM in fp16, and set the hyperparameters as follows: batch size = 4, learning rate = 1e-5, number of epochs = 5, and the reward weight w 100.