Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CCL: Causal-aware In-context Learning for Out-of-Distribution Generalization

Authors: Hoyoon Byun, Gyeongdeok Seo, Joonseong Kang, Taero Kim, Jihee Kim, Kyungwoo Song

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically validate that CCL operates robustly in OOD scenarios and demonstrates superior generalization performance on both synthetic and real datasets. Code is available at: https://github.com/MLAI-Yonsei/causal-context-learning
Researcher Affiliation	Academia	Department of Statistics and Data Science, Yonsei University EMAIL
Pseudocode	No	The paper only describes steps in regular paragraph text and equations without structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at: https://github.com/MLAI-Yonsei/causal-context-learning
Open Datasets	Yes	As another dataset to evaluate the performance of our methodology, we employ the MGSM (Multilingual Grade School Math) dataset [37]. The MGSM dataset is a human-annotated translation of 250 problems from the GSM8K dataset [38] into ten different languages. For MMLU [42], we retrieve five examples for each query without distinguishing among the 57 domains. For Hotpot QA [43], we provide each query with its corresponding document and retrieve examples to form document-example pairs.
Dataset Splits	Yes	First, we extract embeddings for each question using Open AI s text-embedding-3-small model. Based on these embeddings, we split the data into an ID and an OOD dataset. We use Swahili, Thai, Telugu, and Bengali for the OOD dataset, while the remaining languages are designated as ID. In the MGSM dataset, we evaluate performance by measuring the model s prediction accuracy. Similarly to the retrieval process, we use a 5-shot setting to assess performance and compare zero-shot (ZS), ICL (Fixed sample, KNN) and CCL. We follow the same 8-shot setting used in LLM-R to ensure a fair comparison.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) are mentioned in the main body of the paper.
Software Dependencies	No	The paper mentions specific models (e.g., Open AI s text-embedding-3-small model, multilingual-e5-large-instruct [40], GPT-4o-mini, Llama-3.2-3B-IT, Phi-4-mini-IT) but does not provide specific version numbers for general programming languages or libraries used for implementation.
Experiment Setup	Yes	In the K-nearest-neighbor (KNN) variant, the K closest instances, where K equals the predefined shot size (Ω), are selected directly. We also investigate a K-means-based selection method that is governed by two hyperparameters, R and P. In the MGSM dataset, we evaluate performance by measuring the model s prediction accuracy. Similarly to the retrieval process, we use a 5-shot setting to assess performance. We follow the same 8-shot setting used in LLM-R to ensure a fair comparison.