Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Self Iterative Label Refinement via Robust Unlabeled Learning

Authors: Hikaru Asano, Tadashi Kozuno, Yukino Baba

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluations on diverse datasets, including low-resource language corpora, patent classifications, and protein structure categorizations, demonstrate that our method consistently outperforms both initial LLM s classification performance and the self-refinement approaches by cutting-edge models (e.g., GPT-4o and Deep Seek R1). Moreover, we experimentally confirm that our refined classifier facilitates effective post-training alignment for safety in LLMs and demonstrate successful self-refinement in generative tasks as well.1
Researcher Affiliation	Collaboration	Hikaru Asano The University of Tokyo Tokyo, Japan EMAIL Tadashi Kozuno OMRON SINIC X Tokyo, Japan EMAIL Yukino Baba The University of Tokyo Tokyo, Japan EMAIL
Pseudocode	No	The paper describes the iterative pipeline in prose and illustrates it with Figure 1, but does not provide a formal pseudocode block or algorithm.
Open Source Code	Yes	1Our code is available at https://github.com/Hikaru Asano/self-iterative-label-refinement.
Open Datasets	Yes	We evaluate our approach on several public datasets, including low-resource language corpora, patent classification tasks, and protein structure classification. Notably, as illustrated in Figure 3, even in cases where self-refinement methods based on LLMs, or advanced reasoning models such as Deep Seek-R1 [13], fail to produce any performance improvement, our iterative UU learning framework successfully refines its outputs and achieves classification performance that surpasses that of both the original LLM and existing self-improvement pipelines. Moreover, we experimentally show that our refined classifiers, integrated into RLAIF frameworks, effectively achieve safety alignment without extensive human annotation. This underscores our method s potential for comprehensive, robust LLM self-refinement. Datasets: We use six binary classification datasets grouped into two categories based on their difficulty2. Table 2 reports the dataset statistics, and Table 3 provides examples for positive and negative cases (see Appendix B).
Dataset Splits	Yes	For all experiments, we randomly partitioned each dataset into training, validation (for best epoch selection), and test splits in a 7:1:2 ratio.
Hardware Specification	Yes	In our experiments, we employed 8 NVIDIA A100 GPUs (80GB) and leveraged Accelerate19 for distributed training across multiple GPUs.
Software Dependencies	No	We based our implementation on the transformers17 library and conducted training and inference using Py Torch18. In our experiments, we employed 8 NVIDIA A100 GPUs (80GB) and leveraged Accelerate19 for distributed training across multiple GPUs.
Experiment Setup	Yes	Training Procedure: We train the classifier by appending an affine layer to the transformer s final hidden state to yield a one-dimensional score. For fine-tuning efficiency, we employ QLora [14] with 4-bit quantization. At each iteration, we fine-tune the model using the pseudo-positive and pseudo-negative corpora with the Adam W optimizer (learning rate = 1.0 10 4, batch size = 16, and 3 epochs) and fix the robust UU learning hyperparameter λ at -0.001. At the end of each epoch, we compute the loss on a pseudo-labeled validation set and select the model with the lowest loss to re-label the entire dataset. Additional parameters are detailed in Table 4.