Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification

Authors: Hyunseok Lee, Seunghyuk Oh, Jaehyung Kim, Jinwoo Shin, Jihoon Tack

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments on various reasoning tasks demonstrate that Re VISE achieves efficient self-correction and significantly improves the reasoning performance of LLMs.
Researcher Affiliation	Academia	1KAIST 2Yonsei University. Correspondence to: Jihoon Tack <EMAIL>.
Pseudocode	No	The paper describes the methods verbally and with figures, but does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	1Code available at: github.com/seunghyukoh/revise
Open Datasets	Yes	We demonstrated the effectiveness of Re VISE through evaluations on multiple reasoning datasets across mathematical and coding domains. Notably, Re VISE enhances reasoning performance beyond prior methods, improving accuracy from 27.1 31.1% on GSM8K (Maj@3) (Cobbe et al., 2021) with Llama3 1B (Dubey et al., 2024) and from 33.2 36.0% on MATH (Maj@3) (Hendrycks et al., 2021) with Llama3 8B. Furthermore, our experimental results show that Re VISE consistently improves accuracy without relying on external feedback mechanisms, which often degrade performance on complex reasoning tasks. For instance, unlike approaches such as Refine (Madaan et al., 2023), which struggle when combined with existing models on complex tasks, Re VISE achieves these gains purely through self-verification and self-correction. Finally, we show that the proposed sampling scheme is more efficient than other sampling strategies when applied to models trained with Re VISE, further enhancing the performance.
Dataset Splits	Yes	For GSM8K (Cobbe et al., 2021), we train Re VISE using the original training split. For MATH (Hendrycks et al., 2021), we train Re VISE using a 50k subset of Meta Math (Yu et al., 2024), an augmented version of MATH, and use a 3k subset for the validation set, respectively.
Hardware Specification	Yes	For the main development we mainly use Intel(R) Xeon(R) Platinum 8480+ CPU @ 790MHz and a 8 NVIDIA H100 GPUs. Additionally, we used NVIDIA RTX4090 GPUs for evaluation.
Software Dependencies	No	Evaluation details. Used lm-eval-harness for greedy decoding experiments and used our code to evaluate models in sampling settings. Since the output depends on the evaluation batch size, we fixed the batch size to 128 for a fair comparison.
Experiment Setup	Yes	Training details of Re VISE We use Adam W optimizer with a learning rate lr {10 4, 10 5} with 10% warm up and cosine decay and train it for one epoch. We trained with batch size 32 for fine-tuning and 64 for preference tuning . For the λ constant for SFT loss, we used λ = 0.1. During the training, for the data sampling phase, we sampled 10 times for each sample in GSM8K and 4 times for each sample in MATH.