Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ResiDual Transformer Alignment with Spectral Decomposition

Authors: Lorenzo Basile, Valentino Maiorca, Luca Bortolussi, Emanuele Rodolà, Francesco Locatello

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Ultimately, we translate these insights into actionable terms by introducing Resi Dual, a technique for spectral alignment of the residual stream. Much like panning for gold, it lets the noise from irrelevant unit principal components (i.e., attributes) wash away to amplify task-relevant ones. Remarkably, this dual perspective on modality alignment yields fine-tuning level performance on different data distributions while modelling an extremely interpretable and parameter-efficient transformation, as we extensively show on 70 pre-trained network-dataset combinations (7 models, 10 datasets).
Researcher Affiliation Academia Lorenzo Basile , EMAIL University of Trieste Valentino Maiorca , EMAIL Sapienza University of Rome Institute of Science and Technology Austria (ISTA) Luca Bortolussi EMAIL University of Trieste Emanuele Rodolà EMAIL Sapienza University of Rome Francesco Locatello EMAIL Institute of Science and Technology Austria (ISTA)
Pseudocode Yes A.2 Connecting Text Span and Matching Pursuit Algorithm 1: Text Span (Gandelsman et al., 2024) Input : Signal Matrix X Rn,d, dictionary D Rk,d, number of iterations N. Output: Reconstruction XN r , support set CN Initialization: Residual R0 = X, reconstruction X0 r = 0, dictionary D0 = D, support set C0 = ; for t {0, ..., N 1} do P Dt Rt T ; pt arg maxk j=1 Var(P [j]); Ct+1 Ct {pt}; Rt+1 Rt proj(Rt, Dt[pt]); Xt+1 r Xt r + proj(Rt, Dt[pt]); Dt+1 Dt proj(Dt, Dt[pt]); end Algorithm 2: Simultaneous Orthogonal Matching Pursuit (SOMP) (Tropp et al., 2006) Input : Signal Matrix X Rn,d, dictionary D Rk,d, number of iterations N. Output: Reconstruction XN r , support set CN Initialization: Residual R0 = X, reconstruction X0 r = 0, support set C0 = ; for t {0, ..., N 1} do P DRt T ; pt arg maxk j=1(||P [j]||1); Ct+1 Ct {pt}; W t arg min W ||X W D[Ct]||F ; Xt+1 r W t D[Ct]; Rt+1 X Xt+1 r ; end
Open Source Code Yes Equal contribution. Work done while visiting ISTA. Code is available at https://github.com/Flegyas/Resi Dual
Open Datasets Yes Experimental setting We start by evaluating the intrinsic dimensionality of head representations across multiple transformer-based vision architectures pre-trained with different objectives (supervised, unsuper- Published in Transactions on Machine Learning Research (04/2025) vised, self-supervised). Namely, we employ Open AI s CLIP (Radford et al., 2021), Open CLIP (Cherti et al., 2023), BLIP (Li et al., 2022), Vi T (Dosovitskiy et al., 2021) and DINOv2 (Oquab et al., 2024), all in their version based on Vi T-Large (results on Vi T-Base models are in the Appendix in Figure 8). We feed them a subset of the training set of Image Net (Russakovsky et al., 2015) containing 80000 images stratified on the class labels, and we extract the representations for all attention heads. Then, we compute the intrinsic dimensionality of such representations using a linear estimator (PCA) and a nonlinear one (Two NN). Linear Id is computed as the number of components needed by PCA to explain 99% of head variance.
Dataset Splits Yes We feed them a subset of the training set of Image Net (Russakovsky et al., 2015) containing 80000 images stratified on the class labels, and we extract the representations for all attention heads. Experimental setting We consider the same Vi T-based encoders of Section 3.1, and 14 different datasets: Image Net (the same split used in Section 3.1), CIFAR(-100/-10) (Krizhevsky, 2009), Image Net-Sketch (Wang Published in Transactions on Machine Learning Research (04/2025) et al., 2019), Cars (Krause et al., 2013), MNIST (Le Cun et al., 1998), SVHN (Netzer et al., 2011), Euro SAT (Helber et al., 2019), RESISC45 (Cheng et al., 2017), DTD (Cimpoi et al., 2014), SUN397 (Xiao et al., 2016), GTSRB (Stallkamp et al., 2011), PACS (Li et al., 2017), and random images (10000 samples with RGB values in [ 1, 1]). We use the original train/validation/test splits if available, otherwise we produce the splits through a stratified random sampling over the classes. For each encoder, we use our similarity measure to compare its unit representations produced on each training dataset with the ones obtained on the training split of Image Net.
Hardware Specification No The paper does not explicitly mention any specific hardware used for running its experiments, such as GPU or CPU models.
Software Dependencies No All training runs use the Schedule-Free Adam optimizer (Defazio et al., 2024) with the automatic learning rate finder by Smith (2017), implemented in Py Torch Lightning (Falcon & The Py Torch Lightning team, 2019). The maximum number of epochs is 30, with an early-stopping policy on the validation set accuracy with patience of 5 epochs. While PyTorch Lightning is mentioned, no specific version number for it or other software dependencies is provided.
Experiment Setup Yes All training runs use the Schedule-Free Adam optimizer (Defazio et al., 2024) with the automatic learning rate finder by Smith (2017), implemented in Py Torch Lightning (Falcon & The Py Torch Lightning team, 2019). The maximum number of epochs is 30, with an early-stopping policy on the validation set accuracy with patience of 5 epochs.