Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Enhancing Chain of Thought Prompting in Large Language Models via Reasoning Patterns

Authors: Yufeng Zhang, Xuepeng Wang, Lingxiang Wu, Jinqiao Wang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that our method is more robust and consistently leads to improvements across various reasoning tasks. Experiments In this section, our objective is to evaluate the effectiveness of our proposed method and answer the following research questions:
Researcher Affiliation	Collaboration	1Institute of Automation, Chinese Academy of Sciences, Beijing, China 2School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China 3Wuhan AI Research, Wuhan, China EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: Pattern-Co T Demonstration Selection Require: A set of task questions Q Ensure: Demonstration list d = [d1, d2, ..., dk] 1: Acquire operation token set T with LLMs prompting or domain knowledge based on Q 2: for qi Q do 3: Generate rationale ri with Zero-Shot-Co T 4: pi = [] 5: for each token tij ri do 6: if tij T then 7: Update pi with tij 8: end if 9: end for 10: epi = encode(pi) 11: end for 12: Select proper k 13: Cluster all [ep1, ep2, ..., epi] into k clusters 14: Sample d = [d1, d2, ..., dk] from each cluster 15: return d
Open Source Code	Yes	1https://github.com/Magicat128/Pattern-Co T.
Open Datasets	Yes	Datasets. We adopt eight representative datasets for our reasoning tasks: Multi Arith (Roy and Roth 2015), GSM8K (Cobbe et al. 2021), Add Sub (Hosseini et al. 2014), AQUARAT (Ling et al. 2017), Single Eq (Koncel-Kedziorski et al. 2015), SVAMP (Patel, Bhattamishra, and Goyal 2021), Coin-Flip (Wei et al. 2022b), and BIG-bench Date Understanding (Srivastava et al. 2023).
Dataset Splits	No	For a given task Q = {q1, q2, ..., q N} with N questions, we first need to obtain their rationales and answers {qi, ri, ai} that can be used as context for Co T prompting. For data from existing training sets, we can directly use the training data. The paper uses existing datasets, which often have predefined splits, but does not explicitly state the splits (e.g., percentages or counts) within the paper itself.
Hardware Specification	Yes	These models are deployed on our local server, which is equipped with 8 RTX 3090 GPUs, each with 24GB of memory.
Software Dependencies	No	Specifically, we use models from the LLa MA-2 family due to their foundational logical reasoning capabilities and support for Co T prompting. To maintain consistency with (Zhang et al. 2023), we use Sentence-BERT (Reimers and Gurevych 2019) as our encoder and select the all-Mini LM-L6-v2 model for semantic vector representation. We use Captum (Miglani et al. 2023) to achieve this visualization. The paper mentions software tools and models, but does not provide specific version numbers for software libraries or dependencies.
Experiment Setup	Yes	Additionally, we set the hyperparameters with a temperature of 0.4 and top p of 0.9 to manage the model s randomness (Xu et al. 2022).