Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Enhancing Chain of Thought Prompting in Large Language Models via Reasoning Patterns
Authors: Yufeng Zhang, Xuepeng Wang, Lingxiang Wu, Jinqiao Wang
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our method is more robust and consistently leads to improvements across various reasoning tasks. Experiments In this section, our objective is to evaluate the effectiveness of our proposed method and answer the following research questions: |
| Researcher Affiliation | Collaboration | 1Institute of Automation, Chinese Academy of Sciences, Beijing, China 2School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China 3Wuhan AI Research, Wuhan, China EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: Pattern-Co T Demonstration Selection Require: A set of task questions Q Ensure: Demonstration list d = [d1, d2, ..., dk] 1: Acquire operation token set T with LLMs prompting or domain knowledge based on Q 2: for qi Q do 3: Generate rationale ri with Zero-Shot-Co T 4: pi = [] 5: for each token tij ri do 6: if tij T then 7: Update pi with tij 8: end if 9: end for 10: epi = encode(pi) 11: end for 12: Select proper k 13: Cluster all [ep1, ep2, ..., epi] into k clusters 14: Sample d = [d1, d2, ..., dk] from each cluster 15: return d |
| Open Source Code | Yes | 1https://github.com/Magicat128/Pattern-Co T. |
| Open Datasets | Yes | Datasets. We adopt eight representative datasets for our reasoning tasks: Multi Arith (Roy and Roth 2015), GSM8K (Cobbe et al. 2021), Add Sub (Hosseini et al. 2014), AQUARAT (Ling et al. 2017), Single Eq (Koncel-Kedziorski et al. 2015), SVAMP (Patel, Bhattamishra, and Goyal 2021), Coin-Flip (Wei et al. 2022b), and BIG-bench Date Understanding (Srivastava et al. 2023). |
| Dataset Splits | No | For a given task Q = {q1, q2, ..., q N} with N questions, we first need to obtain their rationales and answers {qi, ri, ai} that can be used as context for Co T prompting. For data from existing training sets, we can directly use the training data. The paper uses existing datasets, which often have predefined splits, but does not explicitly state the splits (e.g., percentages or counts) within the paper itself. |
| Hardware Specification | Yes | These models are deployed on our local server, which is equipped with 8 RTX 3090 GPUs, each with 24GB of memory. |
| Software Dependencies | No | Specifically, we use models from the LLa MA-2 family due to their foundational logical reasoning capabilities and support for Co T prompting. To maintain consistency with (Zhang et al. 2023), we use Sentence-BERT (Reimers and Gurevych 2019) as our encoder and select the all-Mini LM-L6-v2 model for semantic vector representation. We use Captum (Miglani et al. 2023) to achieve this visualization. The paper mentions software tools and models, but does not provide specific version numbers for software libraries or dependencies. |
| Experiment Setup | Yes | Additionally, we set the hyperparameters with a temperature of 0.4 and top p of 0.9 to manage the model s randomness (Xu et al. 2022). |