Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Tracing Representation Progression: Analyzing and Enhancing Layer-Wise Similarity

Authors: Jiachen Jiang, Jinxin Zhou, Zhihui Zhu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results on common transformers reveal that representations across layers are positively correlated, with similarity increasing when layers get closer. We conduct experiments on both vision and NLP tasks to demonstrate the performance of the proposed aligned training.
Researcher Affiliation Academia Jiachen Jiang, Jinxin Zhou & Zhihui Zhu Department of Computer Science and Engineering, The Ohio State University, EMAIL
Pseudocode No The paper describes methods and mathematical formulations (e.g., equations for COS, CKA, and aligned loss) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statements about releasing source code or links to code repositories for the described methodology.
Open Datasets Yes We conduct experiments on both vision and NLP tasks to demonstrate the performance of the proposed aligned training. We conduct experiments on both the CIFAR10 and Image Net1K datasets. For text classification tasks, we get Aligned BERT by finetuning a pretrained 12-layer BERTBase model (Devlin, 2018) using aligned training method on GLUE benchmark (Wang et al., 2018) tasks. For text generation task, we get Aligned GPT by finetuning a pretrained 12-layer GPT2 model (Radford et al., 2019) using aligned training method on Wikitext-103 dataset (Merity et al., 2016).
Dataset Splits Yes We conduct experiments on both the CIFAR10 and Image Net1K datasets. The CIFAR10 dataset includes 60,000 color images in 10 classes, each measuring 32 32 pixels. Image Net1K contains 1.2 million color images distributed in 1000 classes. For text classification tasks... on GLUE benchmark... The Wiki Text-103 language modeling dataset consists of over 100 million tokens extracted from Wikipedia s verified good and featured articles.
Hardware Specification Yes For both vision and NLP tasks, we used 4 RTX A5000 GPUs with 24GB of memory each. The model is finetuned using a single 24G RTX A5000 GPU for 70 hours.
Software Dependencies No The paper mentions several models (e.g., Dei T-S, BERTBase, GPT2) and optimizers (Adam W) but does not specify version numbers for any software libraries (e.g., PyTorch, TensorFlow) or programming languages (e.g., Python).
Experiment Setup Yes For optimization, we employ Adam W with an initial learning rate of 0.1. This rate decays according to the Multi Step LR at the 100th and 150th epochs, over a total of 200 epochs. We set the weight decay at 1e-4. The global batch size for both datasets is set at 256. In our Aligned BERT experiments on the GLUE dataset, we used a sequence length of 256. We employed Adam W for optimization with an initial learning rate of 2e-5, and a batch size of 32. Each task underwent fine-tuning for three epochs. For Aligned GPT experiments on the Wiki Text-103 dataset, we maintained the sequence length at 256 and used Adam W with an initial learning rate of 2e-5. In this case, we set the batch size to 8.