Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On the Statistical Mechanisms of Distributional Compositional Generalization

Authors: Jingwen Fu, Nanning Zheng

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 6. Experiment... 6.2. Experiemnts on trade-off and non-trade-off improvement... 6.3. Experiments on Generalization Bounds... Table 1. Values of IA,β( T = T, PS) and GACC over 10 instances... Table 2. Performance comparison across rule complexities
Researcher Affiliation Academia Jingwen Fu 1 Nanning Zheng 1... 1National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi an Jiaotong University. Correspondence to: Nanning Zheng <EMAIL>.
Pseudocode No The paper describes methods and analyses using mathematical formulations and descriptive text, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statements about releasing source code or provide links to a code repository.
Open Datasets No 1. Components and Compositional rule: We construct two words set A,B satisfying |A| = |B| = 1000... 2) We pretrain the GPT-2 model using different pretraining data schedules. The pretraining data is generated from a subset of composition rules same to those in the downstream task, but with entirely different words.
Dataset Splits Yes 2. Distribution Split: The support distribution takes the elements in the set {(e1, e2)|(e1, e2) ∈ a1 × b1 ∪ a2 × b1 ∪ a1 × b2}. The target distribution take elements in the set {(e1, e2)|(e1, e2) ∈ a2 × b2}. It is easy to verify that these designs satisfy the requirement listed in Section 3.
Hardware Specification No The paper mentions using 'GPT-2 model' with specific configurations (4 layers, 4 attention heads, embedding size of 128; or 6 layers, 8 attention heads, embedding size of 256) but does not provide any details about the specific hardware (GPU/CPU models, memory, etc.) used for experiments.
Software Dependencies No The paper refers to using the 'GPT-2 model' but does not provide specific version numbers for GPT-2 or any other software dependencies, such as programming languages or libraries.
Experiment Setup Yes 1) We employ the GPT-2 model with two configurations: Setting 1: 4 layers, 4 attention heads, and an embedding size of 128. Setting 2: 6 layers, 8 attention heads, and an embedding size of 256. 2) We pretrain the GPT-2 model using different pretraining data schedules.