Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations

Authors: Peng Lai, Jianjie Zheng, Sijie Cheng, Yun Chen, Peng Li, Yang Liu, Guanhua Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our method on the standard alignment benchmarks Flask, Help Steer, and BIGGen using Spearman correlation, and find that LAGER achieves improvements of up to 7.5% over the best baseline across these benchmarks. Experiments on downstream applications, such as data selection and emotional understanding, further show the generalization of LAGER.
Researcher Affiliation	Academia	Peng Lai1, Jianjie Zheng1, Sijie Cheng2, Yun Chen3, Peng Li2, Yang Liu2, Guanhua Chen1 1Southern University of Science and Technology, 2Tsinghua University 3Shanghai University of Finance and Economics
Pseudocode	No	The paper describes the LAGER framework using mathematical equations and textual explanations, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is publicly available at https://github.com/sustech-nlp/LAGER.
Open Datasets	Yes	We chose three diverse point-wise benchmarks to ensure a comprehensive evaluation: Flask (Ye et al., 2024a), Help Steer (Wang et al., 2023), and Bi GGen Bench4 (Kim et al., 2024b). In these benchmarks, human-annotated scores range from 1 to 5. Please refer to Appendix E.2 for the details. 4https://huggingface.co/datasets/prometheus-eval/Bi GGen-Bench-Results
Dataset Splits	Yes	We utilize the complete test prompt set from FLASK. ... Help Steer (Wang et al., 2023) is an open-source Helpfulness Dataset. ... A total of 8.95k data points are generated, with the first 2k used for our evaluation. ... We utilize the human evaluation test set. ... We randomly select 1,000 samples from the Help Steer dataset as a held-out validation set, ensuring that it does not overlap with the test set, to tune the layer-wise weights of LAGER.
Hardware Specification	Yes	During the SFT training of the Llama-3-8B model, we use 4 L40 GPUs, and the total batch size during the training phase is 64.
Software Dependencies	No	All experiments are implemented using Py Torch. (No version number for PyTorch or other specific libraries is provided.)
Experiment Setup	Yes	We propose two types of layer weights w. One is to apply average aggregation wl = 1/(L + 1) (denoted as LAGER (w.o. tuning) in Table 1), the other is to tune the lightweight L + 1 parameters (L is the number of transformer layers) with a small-scale validation set (denoted as LAGER (w. tuning)). With a frozen backbone and minimal learnable parameters, we refer to the dataset as a validation set to distinguish it from finetuning-based LLM evaluators. To enhance the model s performance while aligning its predictions more closely with the distribution of human scores, we adopt a combination of cross-entropy(CE) loss and mean absolute error (MAE) loss, balanced by a weighting hyperparameter α. LFinal = α LCE + (1 α) LMAE ... We randomly select 1,000 samples from the Help Steer dataset as a held-out validation set, ensuring that it does not overlap with the test set, to tune the layer-wise weights of LAGER. The training is performed using the Adam optimizer with an initial learning rate of 0.01, and a batch size of 4. A random seed of 42 is set to ensure the reproducibility of the experiment. We also apply the Reduce LROn Plateau learning rate scheduler with a decay factor of 0.5, a patience value of 1, and a minimum learning rate specified by min_lr. Since some models in the Qwen2.5 family may not converge with just one epoch, we set the number of training epochs to 2 for all models in the Qwen2.5 series. For other backbone models, we set the number of training epochs to 1. All experiments are implemented using Py Torch. ... Training Parameters Setting stage sft finetuning_type full template alpaca flash_attn fa2 cutoff_len 2048 learning_rate 2e-5 num_train_epochs 3 per_device_train_batch_size 4 gradient_accumulation_steps 4 lr_scheduler_type cosine warmup_ratio 0.03 packing FALSE bf16 TRUE tf32 TRUE optim adamw_torch include_num_input_tokens_seen TRUE seed 42