Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Scaling can lead to compositional generalization

Authors: Florian Redhardt, Yassir Akram, Simon Schug

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that standard multilayer perceptrons compositionally generalize on a variety of tasks as data and model size are scaled across task encodings if the training distribution sufficiently covers the task space. We prove that multilayer perceptrons can approximate a general class of compositional task families to arbitrary precision using only a linear number of neurons with respect to the number of task modules.
Researcher Affiliation Academia Florian Redhardt ETH Zurich Yassir Akram ETH Zurich Simon Schug Princeton University
Pseudocode Yes Algorithm 1 Interval Shift
Open Source Code Yes Code available at https://github.com/smonsays/scale-compositionality
Open Datasets No We create the hyperteacher task family described in Section 2.4 with I = 16 input neurons, H = 16 hidden neurons and O = 16 output neurons to create a family of compositional regression tasks to be learned by a student. For the definition of the compositional preference task family, please refer to [18]. We construct a large number of compositional tasks that require composing several concepts.
Dataset Splits Yes Fraction of tasks held-out from training 0.125 0.25 0.5 0.75 0.9 0.95 0.98 0.99
Hardware Specification Yes We used a Linux workstation with two Nvidia RTX 3090 GPUs with 24GB of memory each for development and conducted hyperparameter searches and experiments on an internal Slurm cluster using Nvidia RTX 4090 GPUs and Nvidia A100 GPUs.
Software Dependencies No We implemented our experiments in Python using JAX [78, Apache License 2.0], Flax [79, Apache License 2.0], the Deepmind Jax Ecosystem [80, Apache License 2.0], Py Torch [BSD-style license 81], LLM [82, Apache License 2.0] and Scikit-learn [83, BSD 3-Clause License]. The paper lists software and their licenses but does not specify exact version numbers for each dependency.
Experiment Setup Yes Throughout our experiments, we use the Adam W optimizer [73] with a batch size of 128. On the hyperteacher task family, we use a mean-squared error loss, on the compositional preference task family, we use a cross-entropy loss. We performed an initial grid search over the learning rate and weight decay to find a common set of hyperparameters for all experiments on the hyperteacher task family and a common set of hyperparameters for all experiments on the compositional preference tasks respectively. We report the search grid in Table 3.