reproducibilityindex.ai

Fractal Patterns May Illuminate the Success of Next-Token Prediction

Authors: Ibrahim M. Alabdulmohsin, Vinh Tran, Mostafa Dehghani

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We carry out an extensive analysis across different domains and architectures, showing that fractal parameters are robust. Finally, we demonstrate that the tiny variations in fractal parameters seen across LLMs improve upon perplexity-based bits-per-byte (BPB) in predicting their downstream performance.
Researcher Affiliation	Industry	Ibrahim Alabdulmohsin Google Deepmind Zürich, Switzerland ibomohsin@google.com Vinh Q. Tran Google Deepmind New York, USA vqtran@google.com Mostafa Dehghani Google Deepmind Mountain View, USA dehghani@google.com
Pseudocode	No	The paper describes methods in text and uses mathematical formulas but does not include any explicit pseudocode blocks or algorithms.
Open Source Code	No	We only release a portion of the code that can be used to calculate fractal parameters.
Open Datasets	Yes	For analysis, we use The Pile validation split [23], consisting of 22 subdomains such as Wikipedia and Git Hub. [...] To test this hypothesis, we pretrain three decoder-only T5.1.1 models with 1B parameters on Slim Pajama-627B [62] [...] We take the Wikipedia (wikipedia/20230601.en) dataset [67].
Dataset Splits	Yes	For analysis, we use The Pile validation split [23], consisting of 22 subdomains such as Wikipedia and Git Hub.
Hardware Specification	Yes	All experiments are executed on Tensor Processing Units (TPUs). [...] Models are trained using 256 TPUv5e chips [32].
Software Dependencies	No	All of our experiments are conducted in JAX/Flax [10] using the open source T5X framework [56].
Experiment Setup	Yes	Training is done for 500k steps with a sequence length of 1024 and batch size of 512, resulting in a total of 262B tokens seen during pretraining. We optimize our model with the Adafactor [61] optimizer with an inverse square root learning rate schedule, 1k warmup steps, and an initial learning rate of 1e-2.