Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Reparameterized LLM Training via Orthogonal Equivalence Transformation

Authors: Zeju Qiu, Simon Buchholz, Tim Xiao, Maximilian Dax, Bernhard Schölkopf, Weiyang Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments validate the effectiveness and scalability of POET in training LLMs. ... We start by evaluating POET on large-scale LLa MA pretraining, followed by an extensive ablation study to justify our design choices.
Researcher Affiliation Academia 1Max Planck Institute for Intelligent Systems, Tübingen 2The Chinese University of Hong Kong
Pseudocode Yes 3.2.3 Overall Training Algorithm Step 1: Initialization. We initialize the weight matrices using normalized Gaussian: W W0. Step 2: Orthogonal matrix initialization. For fully stochastic SPO, we randomly sample an index set S, and parameterize GR Rb b and GP Rb b using CNP (Equation (6)). Both matrices are initialized as identity, so R and P also start as identity matrices. For block-stochastic SPO, we sample a random permutation matrix ΨR, ΨP , and parameterize { G1 R, , G m b R } and { G1 P , , G n b P } using CNP. Then we initialize them as the identity, so R and P again starts as identity matrices. Step 3: Efficient orthogonal parameterization. For fully stochastic SPO, we have R = Im + D(S)( GR Ib)D(S) and P = In + D(S)( GP Ib)D(S) . For block-stochastic SPO, we have R = Ψ RDiag( G1 R, , G m b R )ΨR and P = Ψ P Diag( G1 P , , G n Step 4: Inner training loop for updating orthogonal matrices. The equivalent weight matrix in the forward pass is RW P . Gradients are backpropagated through R and P to update GR, GP (fully stochastic) or Gi R, Gi P , i (block-stochastic). This inner loop runs for a fixed number of iterations. Step 5: Merge-then-reinitialize. The learned orthogonal matrices R and P are merged into the weight matrix by W RW P . If not terminated, return to Step 2 for reinitialization.
Open Source Code No Our method is implemented on top of the codebase from [82]1 (Apache 2.0 license), which we also use to reproduce the Adam W and Ga Lore baselines. We will release our code for reproducing all training results prior to publication.
Open Datasets Yes We use the C4 dataset [66], a cleaned web crawl corpus from Common Crawl, widely used for LLM pretraining [29, 56, 82]. ... To better evaluate models beyond the validation perplexity, we show the results of finetuning the trained model on the GLUE benchmark [75].
Dataset Splits No The paper mentions using the C4 dataset for pretraining and the GLUE benchmark for finetuning. While these are standard datasets with common splits, the paper does not explicitly state the specific train/validation/test splits (e.g., percentages, sample counts, or explicit reference to 'standard splits used') that were applied in their experiments for either dataset. For instance, for C4, it states 'We use the C4 dataset [66]' without further details on data partitioning.
Hardware Specification Yes Compute Resources. All the training tasks are performed on a NVIDIA HGX H100 8-GPU System node with 80GB memory each. Depending on the model scale, we train on 1, 4 or 8 GPUs.
Software Dependencies No Our method is implemented on top of the codebase from [82]1 (Apache 2.0 license), which we also use to reproduce the Adam W and Ga Lore baselines. ... Our work utilized the Hugging Face Transformers2 code base to construct the Llama model for pretraining, which is under the Apache 2.0 license. The paper mentions software components like 'Hugging Face Transformers' and a codebase from [82] but does not specify any version numbers for these or other programming languages/libraries used.
Experiment Setup Yes We employed the Adam W optimizer [55] for all our training runs. The specific hyperparameters used for each experiment are detailed in the Table 9 and Table 10 referenced below. We use the consine learning rate scheduler with the minimum learning ratio of 0.01. We use the number of warmup steps of 0, weight decay of 0.01 and gradient clipping of 0.1.