Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

RePIC: Reinforced Post-Training for Personalizing Multi-Modal Language Models

Authors: Yeongtak Oh, Dohyun Chung, Juhyeon Shin, Sangha Park, Johan Barthelemy, Jisoo Mok, Sungroh Yoon

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experimental results, we reveal that SFT-based personalization methods are highly limited for visual recognition and generalization abilities. Conversely, by integrating our proposed reward templates along with curated datasets and instructions, our method achieves significant performance improvements over existing baselines, particularly in multi-concept personalized image captioning benchmarks.
Researcher Affiliation	Collaboration	1Department of Electrical and Computer Engineering, Seoul National University 2Department of Future Automotive Mobility, Seoul National University 3Interdisciplinary Program in Artificial Intelligence, Seoul National University 4Daegu Gyeongbuk Institute of Science and Technology 5NVIDIA
Pseudocode	No	The paper describes the GRPO algorithm and the proposed verifiable rewards (OCT, VLT, ICT) using textual descriptions, mathematical formulas (Eq 1-5), and diagrams (Figure 2). There are no distinct blocks labeled "Pseudocode" or "Algorithm".
Open Source Code	Yes	Project page: https://github.com/oyt9306/Re PIC (Page 1) The corresponding training and inference codes are released as open-source on https://github.com/oyt9306/Re PIC. (NeurIPS Checklist, Question 5 justification)
Open Datasets	Yes	We use real datasets such as COCO [29], Objects365 [45], and Celeb A [33], from which we crop object regions to serve as reference images. However, as real data often lacks sufficient variation in attributes such as pose and lighting, we additionally incorporate high-quality, visually diverse synthetic images from Subject 200K+ [50]. (Page 4) The single-concept data are sourced from Yo LLa VA, My VLM, and Dream Booth. (Page 5) For multi-concept evaluation, we use the RAP-MLLM [15] dataset (Page 5).
Dataset Splits	No	The paper lists several datasets used for evaluation and training, such as Yo LLa VA, My VLM, Dream Booth, RAP-MLLM, COCO, Objects365, Celeb A, Subject 200K+, and Refcoco/+/g datasets. However, it does not explicitly provide the train/validation/test splits (e.g., percentages, sample counts, or specific instructions for generating splits) for its own experimental setup across these combined datasets.
Hardware Specification	Yes	All training experiments are conducted using 8 A40 GPUs, with inference performed on a single A40 GPU.
Software Dependencies	No	The paper mentions using LLa MA-Factory [64] for fine-tuning, Qwen2.5-VL-Instruct-7B as the base model, and the faker library. However, it does not provide specific version numbers for these software components or any other key libraries (e.g., Python, PyTorch versions) used in the implementation.
Experiment Setup	Yes	Our implementation is based on the open-source codebase. To train our model, we set Lo RA rank as 64, Lo RA alpha as 128, and use the number of generations per prompt as 8. The base model we used is Qwen2.5-VL-Instruct-7B (Page 11, B.1 Experimental Details). Our results indicate that the combination of βKL = 0.04 and a cutoff length of 100 yields the best performance. (Page 16, C.11 Analysis on Hyperparameter Sensitivity).