Linguistic Collapse: Neural Collapse in (Large) Language Models
Authors: Robert Wu, Vardan Papyan
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper empirically investigates the impact of scaling the architectures and training of causal language models (CLMs) on their progression towards NC. |
| Researcher Affiliation | Academia | Robert Wu University of Toronto, Vector Institute rupert@cs.toronto.edu Vardan Papyan University of Toronto, Vector Institute vardan.papyan@utoronto.ca |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | Our code is hosted on Git Hub: https://github.com/rhubarbwu/linguistic-collapse. Codes for (post-)training analysis are hosted on Git Hub: Main code https://github.com/rhubarbwu/linguistic-collapse Auxillary package: https://github.com/rhubarbwu/neural-collapse |
| Open Datasets | Yes | Tiny Stories [2] is a synthetic4 dataset generated by GPT-3.5 and GPT-4 using around 1500 English words a child might use. |
| Dataset Splits | Yes | The 2,141,709 stories are split into 2,119,719 train and 21,990 validation stories. |
| Hardware Specification | Yes | Each model was trained on a single NVIDIA A100 (40GB) GPU for up to 8 hours per epoch. |
| Software Dependencies | Yes | transformers_version 4.28.1 |
| Experiment Setup | Yes | We use 30 CLM architectures based on GPT Neo [80], configured similarly to [2]. They vary in width (embedding dimension) d {64, 128, 256, 512, 768, 1024} and depth (number of self-attention layers) L {1, 2, 4, 8, 12}. Our models were trained by teacher-forcing7 using CE loss. For each architecture, we trained multiple models for 1, 3, and 10 epochs ablated over weight decay factors β = 0.0005 [51] and β = 0.1 [81]. |