Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Scaling can lead to compositional generalization
Authors: Florian Redhardt, Yassir Akram, Simon Schug
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that standard multilayer perceptrons compositionally generalize on a variety of tasks as data and model size are scaled across task encodings if the training distribution sufficiently covers the task space. We prove that multilayer perceptrons can approximate a general class of compositional task families to arbitrary precision using only a linear number of neurons with respect to the number of task modules. |
| Researcher Affiliation | Academia | Florian Redhardt ETH Zurich Yassir Akram ETH Zurich Simon Schug Princeton University |
| Pseudocode | Yes | Algorithm 1 Interval Shift |
| Open Source Code | Yes | Code available at https://github.com/smonsays/scale-compositionality |
| Open Datasets | No | We create the hyperteacher task family described in Section 2.4 with I = 16 input neurons, H = 16 hidden neurons and O = 16 output neurons to create a family of compositional regression tasks to be learned by a student. For the definition of the compositional preference task family, please refer to [18]. We construct a large number of compositional tasks that require composing several concepts. |
| Dataset Splits | Yes | Fraction of tasks held-out from training 0.125 0.25 0.5 0.75 0.9 0.95 0.98 0.99 |
| Hardware Specification | Yes | We used a Linux workstation with two Nvidia RTX 3090 GPUs with 24GB of memory each for development and conducted hyperparameter searches and experiments on an internal Slurm cluster using Nvidia RTX 4090 GPUs and Nvidia A100 GPUs. |
| Software Dependencies | No | We implemented our experiments in Python using JAX [78, Apache License 2.0], Flax [79, Apache License 2.0], the Deepmind Jax Ecosystem [80, Apache License 2.0], Py Torch [BSD-style license 81], LLM [82, Apache License 2.0] and Scikit-learn [83, BSD 3-Clause License]. The paper lists software and their licenses but does not specify exact version numbers for each dependency. |
| Experiment Setup | Yes | Throughout our experiments, we use the Adam W optimizer [73] with a batch size of 128. On the hyperteacher task family, we use a mean-squared error loss, on the compositional preference task family, we use a cross-entropy loss. We performed an initial grid search over the learning rate and weight decay to find a common set of hyperparameters for all experiments on the hyperteacher task family and a common set of hyperparameters for all experiments on the compositional preference tasks respectively. We report the search grid in Table 3. |