Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
CCL: Causal-aware In-context Learning for Out-of-Distribution Generalization
Authors: Hoyoon Byun, Gyeongdeok Seo, Joonseong Kang, Taero Kim, Jihee Kim, Kyungwoo Song
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate that CCL operates robustly in OOD scenarios and demonstrates superior generalization performance on both synthetic and real datasets. Code is available at: https://github.com/MLAI-Yonsei/causal-context-learning |
| Researcher Affiliation | Academia | Department of Statistics and Data Science, Yonsei University EMAIL |
| Pseudocode | No | The paper only describes steps in regular paragraph text and equations without structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at: https://github.com/MLAI-Yonsei/causal-context-learning |
| Open Datasets | Yes | As another dataset to evaluate the performance of our methodology, we employ the MGSM (Multilingual Grade School Math) dataset [37]. The MGSM dataset is a human-annotated translation of 250 problems from the GSM8K dataset [38] into ten different languages. For MMLU [42], we retrieve five examples for each query without distinguishing among the 57 domains. For Hotpot QA [43], we provide each query with its corresponding document and retrieve examples to form document-example pairs. |
| Dataset Splits | Yes | First, we extract embeddings for each question using Open AI s text-embedding-3-small model. Based on these embeddings, we split the data into an ID and an OOD dataset. We use Swahili, Thai, Telugu, and Bengali for the OOD dataset, while the remaining languages are designated as ID. In the MGSM dataset, we evaluate performance by measuring the model s prediction accuracy. Similarly to the retrieval process, we use a 5-shot setting to assess performance and compare zero-shot (ZS), ICL (Fixed sample, KNN) and CCL. We follow the same 8-shot setting used in LLM-R to ensure a fair comparison. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) are mentioned in the main body of the paper. |
| Software Dependencies | No | The paper mentions specific models (e.g., Open AI s text-embedding-3-small model, multilingual-e5-large-instruct [40], GPT-4o-mini, Llama-3.2-3B-IT, Phi-4-mini-IT) but does not provide specific version numbers for general programming languages or libraries used for implementation. |
| Experiment Setup | Yes | In the K-nearest-neighbor (KNN) variant, the K closest instances, where K equals the predefined shot size (Ω), are selected directly. We also investigate a K-means-based selection method that is governed by two hyperparameters, R and P. In the MGSM dataset, we evaluate performance by measuring the model s prediction accuracy. Similarly to the retrieval process, we use a 5-shot setting to assess performance. We follow the same 8-shot setting used in LLM-R to ensure a fair comparison. |