Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Enhancing LLM’s Cognition via Structurization

Authors: Kai Liu, Zhihang Fu, Chao Chen, Wei Zhang, Rongxin Jiang, Fan Zhou, Yaowu Chen, Yue Wu, Jieping Ye

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations are conducted across various model architectures and sizes (including a series of auto-regressive LLMs as well as BERT-like masking models) on a diverse set of NLP tasks (e.g., context-based question-answering, exhaustive hallucination evaluation, and passage-level dense retrieval). Empirical results show consistent and significant performance gains afforded by a single-round structurization.
Researcher Affiliation Collaboration 1Zhejiang University, 2Alibaba Cloud
Pseudocode No The paper provides a prompt template (Fig. 3) and describes steps in text, but it does not include a dedicated pseudocode or algorithm block.
Open Source Code Yes Code is available at https://github.com/alibaba/struxgpt.
Open Datasets Yes Long Bench [3] is a multi-task benchmark tailored for long context understanding evaluation, composed of 6 major task categories and 21 different tasks. (...) Attr Score [66] and Fact Score [42] datasets are adopted for evaluation. (...) BEIR dataset [52] is a popular benchmark for evaluating dense retrievers zero-shot effectiveness [39, 33]
Dataset Splits Yes From the collected samples, 200 are utilized for evaluation (including human verification), and the remaining training samples are adopted to distill the structurization capability from Qwen-Max to our Stru XGPT-7B.
Hardware Specification Yes The training is resource-friendly, which can be done on 8 NVIDIA V100 (16G) GPUs for 3.5 hours. For all the inference experiments, we leverage 1-2 NVIDIA A100-80G GPUs for model deployment. (...) The inference time, measured in seconds per sample, is calculated on an NVIDIA A100 GPU with vllm 6 acceleration (except for the LLa MA2-70B model, which demands at least two A100 GPUs for deployment).
Software Dependencies No The paper mentions 'vllm 6 acceleration' but '6' refers to a citation, not a specific version number. It does not provide other specific software or library version numbers.
Experiment Setup Yes Stru XGPT is trained with a constant learning rate of 5 10 6 for LLa MA and 1 10 5 for Qwen for 1 epoch. The batch size is 128, and other hyper-parameters follow the default settings from Touvron et al. [53] and Bai et al. [2].