Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
PEARL: Towards Permutation-Resilient LLMs
Authors: Liang CHEN, Li Shen, Yang Deng, Xiaoyan Zhao, Bin Liang, Kam-Fai Wong
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on synthetic pre-training and real-world instruction tuning tasks demonstrate that PEARL effectively mitigates permutation attacks and enhances performance. |
| Researcher Affiliation | Academia | 1The Chinese University of Hong Kong 2Shenzhen Campus of Sun Yat-sen University 3SMU EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1: Adversarial Optimization Algorithm for PEARL |
| Open Source Code | Yes | The code is available at https://github.com/Chan Liang/PEARL. |
| Open Datasets | Yes | We validate our method in two scenarios: (1) pretraining a transformer to in-context learn linear functions (Garg et al., 2022), and (2) instruction tuning of LLMs on the Super-Natural Instructions (Wang et al., 2022). |
| Dataset Splits | Yes | We selected 17 representative tasks, comprising 9 natural language generation (NLG) tasks and 8 natural language understanding (NLU) tasks. Following the methodology of Wang et al. (2022), we randomly designated 4 datasets as held-out test sets and used the remaining 13 datasets for training. Each training dataset contains 150 examples, and each test dataset contains 100 examples, resulting in a training set of 1,950 examples and a test set of 400 examples, as summarized in Table 2. |
| Hardware Specification | Yes | We train the models on the instruction dataset for two epochs using a single NVIDIA A40 GPU, with a batch size of 16, resulting in a total of 246 training steps. |
| Software Dependencies | No | The paper mentions models and optimizers like GPT-2, Adam W, BERT-base, LLa MA3-8B, FLAN-large, and Lo RA, but does not provide specific version numbers for any underlying software libraries (e.g., PyTorch, TensorFlow, Python version). |
| Experiment Setup | Yes | Key training parameters include a batch size of 128 and 500k training steps. In the PEARL framework, the P-Net is initialized as a BERT-base (Devlin et al., 2019a) and also trained from scratch. ... We train the models on the instruction dataset for two epochs using a single NVIDIA A40 GPU, with a batch size of 16, resulting in a total of 246 training steps. The optimizer used was Adam W. The learning rates for the P-Net and the LLM are set to 1 10 4 and 3 10 4, respectively. For the Sinkhorn algorithm, we use 80 iterations, a temperature parameter of 0.1, and an entropy constraint coefficient β = 1.0. Table 6 also lists hyperparameter settings. |