reproducibilityindex.ai

Enhancing LLM’s Cognition via Structurization

Authors: Kai Liu, Zhihang Fu, Chao Chen, Wei Zhang, Rongxin Jiang, Fan Zhou, Yaowu Chen, Yue Wu, Jieping Ye

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluations are conducted across various model architectures and sizes (including a series of auto-regressive LLMs as well as BERT-like masking models) on a diverse set of NLP tasks (e.g., context-based question-answering, exhaustive hallucination evaluation, and passage-level dense retrieval). Empirical results show consistent and significant performance gains afforded by a single-round structurization.
Researcher Affiliation	Collaboration	1Zhejiang University, 2Alibaba Cloud
Pseudocode	No	The paper provides a prompt template (Fig. 3) and describes steps in text, but it does not include a dedicated pseudocode or algorithm block.
Open Source Code	Yes	Code is available at https://github.com/alibaba/struxgpt.
Open Datasets	Yes	Long Bench [3] is a multi-task benchmark tailored for long context understanding evaluation, composed of 6 major task categories and 21 different tasks. (...) Attr Score [66] and Fact Score [42] datasets are adopted for evaluation. (...) BEIR dataset [52] is a popular benchmark for evaluating dense retrievers zero-shot effectiveness [39, 33]
Dataset Splits	Yes	From the collected samples, 200 are utilized for evaluation (including human verification), and the remaining training samples are adopted to distill the structurization capability from Qwen-Max to our Stru XGPT-7B.
Hardware Specification	Yes	The training is resource-friendly, which can be done on 8 NVIDIA V100 (16G) GPUs for 3.5 hours. For all the inference experiments, we leverage 1-2 NVIDIA A100-80G GPUs for model deployment. (...) The inference time, measured in seconds per sample, is calculated on an NVIDIA A100 GPU with vllm 6 acceleration (except for the LLa MA2-70B model, which demands at least two A100 GPUs for deployment).
Software Dependencies	No	The paper mentions 'vllm 6 acceleration' but '6' refers to a citation, not a specific version number. It does not provide other specific software or library version numbers.
Experiment Setup	Yes	Stru XGPT is trained with a constant learning rate of 5 10 6 for LLa MA and 1 10 5 for Qwen for 1 epoch. The batch size is 128, and other hyper-parameters follow the default settings from Touvron et al. [53] and Bai et al. [2].