Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification
Authors: Hyunseok Lee, Seunghyuk Oh, Jaehyung Kim, Jinwoo Shin, Jihoon Tack
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on various reasoning tasks demonstrate that Re VISE achieves efficient self-correction and significantly improves the reasoning performance of LLMs. |
| Researcher Affiliation | Academia | 1KAIST 2Yonsei University. Correspondence to: Jihoon Tack <EMAIL>. |
| Pseudocode | No | The paper describes the methods verbally and with figures, but does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | 1Code available at: github.com/seunghyukoh/revise |
| Open Datasets | Yes | We demonstrated the effectiveness of Re VISE through evaluations on multiple reasoning datasets across mathematical and coding domains. Notably, Re VISE enhances reasoning performance beyond prior methods, improving accuracy from 27.1 31.1% on GSM8K (Maj@3) (Cobbe et al., 2021) with Llama3 1B (Dubey et al., 2024) and from 33.2 36.0% on MATH (Maj@3) (Hendrycks et al., 2021) with Llama3 8B. Furthermore, our experimental results show that Re VISE consistently improves accuracy without relying on external feedback mechanisms, which often degrade performance on complex reasoning tasks. For instance, unlike approaches such as Refine (Madaan et al., 2023), which struggle when combined with existing models on complex tasks, Re VISE achieves these gains purely through self-verification and self-correction. Finally, we show that the proposed sampling scheme is more efficient than other sampling strategies when applied to models trained with Re VISE, further enhancing the performance. |
| Dataset Splits | Yes | For GSM8K (Cobbe et al., 2021), we train Re VISE using the original training split. For MATH (Hendrycks et al., 2021), we train Re VISE using a 50k subset of Meta Math (Yu et al., 2024), an augmented version of MATH, and use a 3k subset for the validation set, respectively. |
| Hardware Specification | Yes | For the main development we mainly use Intel(R) Xeon(R) Platinum 8480+ CPU @ 790MHz and a 8 NVIDIA H100 GPUs. Additionally, we used NVIDIA RTX4090 GPUs for evaluation. |
| Software Dependencies | No | Evaluation details. Used lm-eval-harness for greedy decoding experiments and used our code to evaluate models in sampling settings. Since the output depends on the evaluation batch size, we fixed the batch size to 128 for a fair comparison. |
| Experiment Setup | Yes | Training details of Re VISE We use Adam W optimizer with a learning rate lr {10 4, 10 5} with 10% warm up and cosine decay and train it for one epoch. We trained with batch size 32 for fine-tuning and 64 for preference tuning . For the λ constant for SFT loss, we used λ = 0.1. During the training, for the data sampling phase, we sampled 10 times for each sample in GSM8K and 4 times for each sample in MATH. |