Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Confucius: Iterative Tool Learning from Introspection Feedback by Easy-to-Difficult Curriculum
Authors: Shen Gao, Zhengliang Shi, Minghang Zhu, Bowen Fang, Xin Xin, Pengjie Ren, Zhumin Chen, Jun Ma, Zhaochun Ren
AAAI 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments conducted on both controlled and real-world settings demonstrate the superiority of our tool learning framework in real-world application scenarios compared to both tuning-free (e.g., Chat GPT, Claude) and tuning-based baselines (e.g., GPT4Tools). |
| Researcher Affiliation | Academia | Shen Gao1*, Zhengliang Shi1*, Minghang Zhu1, Bowen Fang1, Xin Xin1, Pengjie Ren1, Zhumin Chen1, Jun Ma1, Zhaochun Ren2 1Shandong University, Qingdao, China 2Leiden University, Leiden, The Netherlands |
| Pseudocode | No | The paper describes its methods verbally and through equations but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide a direct link to a source code repository or an explicit statement about the release of their implementation code. |
| Open Datasets | No | The paper describes constructing its own tool-use dataset by prompting Chat GPT and manually building a seed instance pool, but it does not provide concrete access information (link, DOI, specific citation) for this dataset to be publicly available. |
| Dataset Splits | No | The paper mentions using a 'training set' and 'test sets' ('Seen' and 'Unseen' each with 2,000 instances), but it does not specify the overall dataset size or explicit percentages/counts for training, validation, and testing splits, nor does it refer to predefined splits with citations. |
| Hardware Specification | Yes | The training of our model can be done within 20 hours with 4 NVIDIA A100-PCIE-80GB GPUs. |
| Software Dependencies | No | The paper mentions 'deepspeed Ze RO strategy' and 'LLa MA-7B' as the base model, but does not provide specific version numbers for software libraries or dependencies like Python, PyTorch, or DeepSpeed itself. |
| Experiment Setup | Yes | We optimize the model using deepspeed Ze RO strategy (Rasley et al. 2020) with the learning rate of 5e 5 and the weight decay coefficient of 0.01. |