Linguistic Collapse: Neural Collapse in (Large) Language Models

Authors: Robert Wu, Vardan Papyan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper empirically investigates the impact of scaling the architectures and training of causal language models (CLMs) on their progression towards NC.
Researcher Affiliation Academia Robert Wu University of Toronto, Vector Institute rupert@cs.toronto.edu Vardan Papyan University of Toronto, Vector Institute vardan.papyan@utoronto.ca
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes Our code is hosted on Git Hub: https://github.com/rhubarbwu/linguistic-collapse. Codes for (post-)training analysis are hosted on Git Hub: Main code https://github.com/rhubarbwu/linguistic-collapse Auxillary package: https://github.com/rhubarbwu/neural-collapse
Open Datasets Yes Tiny Stories [2] is a synthetic4 dataset generated by GPT-3.5 and GPT-4 using around 1500 English words a child might use.
Dataset Splits Yes The 2,141,709 stories are split into 2,119,719 train and 21,990 validation stories.
Hardware Specification Yes Each model was trained on a single NVIDIA A100 (40GB) GPU for up to 8 hours per epoch.
Software Dependencies Yes transformers_version 4.28.1
Experiment Setup Yes We use 30 CLM architectures based on GPT Neo [80], configured similarly to [2]. They vary in width (embedding dimension) d {64, 128, 256, 512, 768, 1024} and depth (number of self-attention layers) L {1, 2, 4, 8, 12}. Our models were trained by teacher-forcing7 using CE loss. For each architecture, we trained multiple models for 1, 3, and 10 epochs ablated over weight decay factors β = 0.0005 [51] and β = 0.1 [81].