Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
In-Context Compositional Learning vis Sparse Coding Transformer
Authors: Wei Chen, Jingxi Yu, Zichen Miao, Qiang Qiu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first assess the effectiveness of our method on a toy example with a simple compositional rule, demonstrating that our approach successfully learns and generalizes the rule, whereas the standard Transformer fails in this case. The results are shown in Figure 3. We then evaluate our method on the in-context compositional learning dataset, such as S-RAVEN [20] and RAVEN [29]. Our approach consistently outperforms standard Transformer baselines. |
| Researcher Affiliation | Academia | Wei Chen, Jingxi Yu, Zichen Miao, Qiang Qiu Purdue University, IN, USA EMAIL |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are provided. The methodology is described in prose and mathematical equations within Section 2.3 and Section 9. |
| Open Source Code | Yes | Answer: [Yes] Justification: We have provided the code for our main experimental results. |
| Open Datasets | Yes | We then evaluate our method on the in-context compositional learning dataset, such as S-RAVEN [20] and RAVEN [29]. Our approach consistently outperforms standard Transformer baselines. |
| Dataset Splits | Yes | To evaluate whether a model trained on a subset of rule combinations can generalize to unseen combinations, we partition all possible rule combinations into separate training and test sets, where 25% of the combinations are held out for testing. Model performance is assessed by measuring the accuracy of correctly predicted examples from the test set. |
| Hardware Specification | Yes | We conducted development and experiments on a Linux workstation equipped with a single NVIDIA A5000 GPU (24GB memory). |
| Software Dependencies | No | The experimental setting of synthetic data. ...We optimize the model using the Adam optimizer with a learning rate of 0.001 and employ mean squared error (MSE) loss as the training objective. (Similar mentions for S-RAVEN and RAVEN but no version numbers for libraries/frameworks). |
| Experiment Setup | Yes | The experimental setting of synthetic data. ...The input sequence length is fixed at N = 32, with a feature dimension of 16 and a single attention head (H = 1). Training is performed over 200 epochs using a batch size of 128. We optimize the model using the Adam optimizer with a learning rate of 0.001 and employ mean squared error (MSE) loss as the training objective. ...The experimental setting of S-RAVEN. ...varying the number of layers between 4 and 8. The input has a feature dimension of 128 and 16 attention heads (H = 16). ...Training is conducted for one epoch using a batch size of 128, the Adam optimizer with a learning rate of 0.001 and a weight decay of 0.1, and the cross-entropy loss as the objective. ...The experimental setting of RAVEN. ...The model is a standard Transformer with 4 layers, a sequence length of N = 36, a feature dimension of 512, and 16 attention heads (H = 16). Training is performed over 2000 epochs with a batch size of 256, using the Adam optimizer with a learning rate of 0.0001. The model is trained to minimize mean squared error (MSE) loss. |