Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Enhancing LLM’s Cognition via Structurization
Authors: Kai Liu, Zhihang Fu, Chao Chen, Wei Zhang, Rongxin Jiang, Fan Zhou, Yaowu Chen, Yue Wu, Jieping Ye
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluations are conducted across various model architectures and sizes (including a series of auto-regressive LLMs as well as BERT-like masking models) on a diverse set of NLP tasks (e.g., context-based question-answering, exhaustive hallucination evaluation, and passage-level dense retrieval). Empirical results show consistent and significant performance gains afforded by a single-round structurization. |
| Researcher Affiliation | Collaboration | 1Zhejiang University, 2Alibaba Cloud |
| Pseudocode | No | The paper provides a prompt template (Fig. 3) and describes steps in text, but it does not include a dedicated pseudocode or algorithm block. |
| Open Source Code | Yes | Code is available at https://github.com/alibaba/struxgpt. |
| Open Datasets | Yes | Long Bench [3] is a multi-task benchmark tailored for long context understanding evaluation, composed of 6 major task categories and 21 different tasks. (...) Attr Score [66] and Fact Score [42] datasets are adopted for evaluation. (...) BEIR dataset [52] is a popular benchmark for evaluating dense retrievers zero-shot effectiveness [39, 33] |
| Dataset Splits | Yes | From the collected samples, 200 are utilized for evaluation (including human verification), and the remaining training samples are adopted to distill the structurization capability from Qwen-Max to our Stru XGPT-7B. |
| Hardware Specification | Yes | The training is resource-friendly, which can be done on 8 NVIDIA V100 (16G) GPUs for 3.5 hours. For all the inference experiments, we leverage 1-2 NVIDIA A100-80G GPUs for model deployment. (...) The inference time, measured in seconds per sample, is calculated on an NVIDIA A100 GPU with vllm 6 acceleration (except for the LLa MA2-70B model, which demands at least two A100 GPUs for deployment). |
| Software Dependencies | No | The paper mentions 'vllm 6 acceleration' but '6' refers to a citation, not a specific version number. It does not provide other specific software or library version numbers. |
| Experiment Setup | Yes | Stru XGPT is trained with a constant learning rate of 5 10 6 for LLa MA and 1 10 5 for Qwen for 1 epoch. The batch size is 128, and other hyper-parameters follow the default settings from Touvron et al. [53] and Bai et al. [2]. |