Fractal Patterns May Illuminate the Success of Next-Token Prediction
Authors: Ibrahim M. Alabdulmohsin, Vinh Tran, Mostafa Dehghani
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We carry out an extensive analysis across different domains and architectures, showing that fractal parameters are robust. Finally, we demonstrate that the tiny variations in fractal parameters seen across LLMs improve upon perplexity-based bits-per-byte (BPB) in predicting their downstream performance. |
| Researcher Affiliation | Industry | Ibrahim Alabdulmohsin Google Deepmind Zürich, Switzerland ibomohsin@google.com Vinh Q. Tran Google Deepmind New York, USA vqtran@google.com Mostafa Dehghani Google Deepmind Mountain View, USA dehghani@google.com |
| Pseudocode | No | The paper describes methods in text and uses mathematical formulas but does not include any explicit pseudocode blocks or algorithms. |
| Open Source Code | No | We only release a portion of the code that can be used to calculate fractal parameters. |
| Open Datasets | Yes | For analysis, we use The Pile validation split [23], consisting of 22 subdomains such as Wikipedia and Git Hub. [...] To test this hypothesis, we pretrain three decoder-only T5.1.1 models with 1B parameters on Slim Pajama-627B [62] [...] We take the Wikipedia (wikipedia/20230601.en) dataset [67]. |
| Dataset Splits | Yes | For analysis, we use The Pile validation split [23], consisting of 22 subdomains such as Wikipedia and Git Hub. |
| Hardware Specification | Yes | All experiments are executed on Tensor Processing Units (TPUs). [...] Models are trained using 256 TPUv5e chips [32]. |
| Software Dependencies | No | All of our experiments are conducted in JAX/Flax [10] using the open source T5X framework [56]. |
| Experiment Setup | Yes | Training is done for 500k steps with a sequence length of 1024 and batch size of 512, resulting in a total of 262B tokens seen during pretraining. We optimize our model with the Adafactor [61] optimizer with an inverse square root learning rate schedule, 1k warmup steps, and an initial learning rate of 1e-2. |