Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

PEARL: Towards Permutation-Resilient LLMs

Authors: Liang CHEN, Li Shen, Yang Deng, Xiaoyan Zhao, Bin Liang, Kam-Fai Wong

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on synthetic pre-training and real-world instruction tuning tasks demonstrate that PEARL effectively mitigates permutation attacks and enhances performance.
Researcher Affiliation	Academia	1The Chinese University of Hong Kong 2Shenzhen Campus of Sun Yat-sen University 3SMU EMAIL EMAIL
Pseudocode	Yes	Algorithm 1: Adversarial Optimization Algorithm for PEARL
Open Source Code	Yes	The code is available at https://github.com/Chan Liang/PEARL.
Open Datasets	Yes	We validate our method in two scenarios: (1) pretraining a transformer to in-context learn linear functions (Garg et al., 2022), and (2) instruction tuning of LLMs on the Super-Natural Instructions (Wang et al., 2022).
Dataset Splits	Yes	We selected 17 representative tasks, comprising 9 natural language generation (NLG) tasks and 8 natural language understanding (NLU) tasks. Following the methodology of Wang et al. (2022), we randomly designated 4 datasets as held-out test sets and used the remaining 13 datasets for training. Each training dataset contains 150 examples, and each test dataset contains 100 examples, resulting in a training set of 1,950 examples and a test set of 400 examples, as summarized in Table 2.
Hardware Specification	Yes	We train the models on the instruction dataset for two epochs using a single NVIDIA A40 GPU, with a batch size of 16, resulting in a total of 246 training steps.
Software Dependencies	No	The paper mentions models and optimizers like GPT-2, Adam W, BERT-base, LLa MA3-8B, FLAN-large, and Lo RA, but does not provide specific version numbers for any underlying software libraries (e.g., PyTorch, TensorFlow, Python version).
Experiment Setup	Yes	Key training parameters include a batch size of 128 and 500k training steps. In the PEARL framework, the P-Net is initialized as a BERT-base (Devlin et al., 2019a) and also trained from scratch. ... We train the models on the instruction dataset for two epochs using a single NVIDIA A40 GPU, with a batch size of 16, resulting in a total of 246 training steps. The optimizer used was Adam W. The learning rates for the P-Net and the LLM are set to 1 10 4 and 3 10 4, respectively. For the Sinkhorn algorithm, we use 80 iterations, a temperature parameter of 0.1, and an entropy constraint coefficient β = 1.0. Table 6 also lists hyperparameter settings.