Fractal Patterns May Illuminate the Success of Next-Token Prediction

Authors: Ibrahim M. Alabdulmohsin, Vinh Tran, Mostafa Dehghani

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We carry out an extensive analysis across different domains and architectures, showing that fractal parameters are robust. Finally, we demonstrate that the tiny variations in fractal parameters seen across LLMs improve upon perplexity-based bits-per-byte (BPB) in predicting their downstream performance.
Researcher Affiliation Industry Ibrahim Alabdulmohsin Google Deepmind Zürich, Switzerland ibomohsin@google.com Vinh Q. Tran Google Deepmind New York, USA vqtran@google.com Mostafa Dehghani Google Deepmind Mountain View, USA dehghani@google.com
Pseudocode No The paper describes methods in text and uses mathematical formulas but does not include any explicit pseudocode blocks or algorithms.
Open Source Code No We only release a portion of the code that can be used to calculate fractal parameters.
Open Datasets Yes For analysis, we use The Pile validation split [23], consisting of 22 subdomains such as Wikipedia and Git Hub. [...] To test this hypothesis, we pretrain three decoder-only T5.1.1 models with 1B parameters on Slim Pajama-627B [62] [...] We take the Wikipedia (wikipedia/20230601.en) dataset [67].
Dataset Splits Yes For analysis, we use The Pile validation split [23], consisting of 22 subdomains such as Wikipedia and Git Hub.
Hardware Specification Yes All experiments are executed on Tensor Processing Units (TPUs). [...] Models are trained using 256 TPUv5e chips [32].
Software Dependencies No All of our experiments are conducted in JAX/Flax [10] using the open source T5X framework [56].
Experiment Setup Yes Training is done for 500k steps with a sequence length of 1024 and batch size of 512, resulting in a total of 262B tokens seen during pretraining. We optimize our model with the Adafactor [61] optimizer with an inverse square root learning rate schedule, 1k warmup steps, and an initial learning rate of 1e-2.