Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

In-Context Compositional Learning vis Sparse Coding Transformer

Authors: Wei Chen, Jingxi Yu, Zichen Miao, Qiang Qiu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We first assess the effectiveness of our method on a toy example with a simple compositional rule, demonstrating that our approach successfully learns and generalizes the rule, whereas the standard Transformer fails in this case. The results are shown in Figure 3. We then evaluate our method on the in-context compositional learning dataset, such as S-RAVEN [20] and RAVEN [29]. Our approach consistently outperforms standard Transformer baselines.
Researcher Affiliation	Academia	Wei Chen, Jingxi Yu, Zichen Miao, Qiang Qiu Purdue University, IN, USA EMAIL
Pseudocode	No	No explicit pseudocode or algorithm blocks are provided. The methodology is described in prose and mathematical equations within Section 2.3 and Section 9.
Open Source Code	Yes	Answer: [Yes] Justification: We have provided the code for our main experimental results.
Open Datasets	Yes	We then evaluate our method on the in-context compositional learning dataset, such as S-RAVEN [20] and RAVEN [29]. Our approach consistently outperforms standard Transformer baselines.
Dataset Splits	Yes	To evaluate whether a model trained on a subset of rule combinations can generalize to unseen combinations, we partition all possible rule combinations into separate training and test sets, where 25% of the combinations are held out for testing. Model performance is assessed by measuring the accuracy of correctly predicted examples from the test set.
Hardware Specification	Yes	We conducted development and experiments on a Linux workstation equipped with a single NVIDIA A5000 GPU (24GB memory).
Software Dependencies	No	The experimental setting of synthetic data. ...We optimize the model using the Adam optimizer with a learning rate of 0.001 and employ mean squared error (MSE) loss as the training objective. (Similar mentions for S-RAVEN and RAVEN but no version numbers for libraries/frameworks).
Experiment Setup	Yes	The experimental setting of synthetic data. ...The input sequence length is fixed at N = 32, with a feature dimension of 16 and a single attention head (H = 1). Training is performed over 200 epochs using a batch size of 128. We optimize the model using the Adam optimizer with a learning rate of 0.001 and employ mean squared error (MSE) loss as the training objective. ...The experimental setting of S-RAVEN. ...varying the number of layers between 4 and 8. The input has a feature dimension of 128 and 16 attention heads (H = 16). ...Training is conducted for one epoch using a batch size of 128, the Adam optimizer with a learning rate of 0.001 and a weight decay of 0.1, and the cross-entropy loss as the objective. ...The experimental setting of RAVEN. ...The model is a standard Transformer with 4 layers, a sequence length of N = 36, a feature dimension of 512, and 16 attention heads (H = 16). Training is performed over 2000 epochs with a batch size of 256, using the Adam optimizer with a learning rate of 0.0001. The model is trained to minimize mean squared error (MSE) loss.