Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs
Authors: Yuhao Wu, Ming Shan Hee, Zhiqiang Hu, Roy Ka-Wei Lee
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform extensive experiments on both open-source and closed-source models, revealing that despite their advanced capabilities, most models struggle significantly with super-longform generation tasks, particularly in maintaining instruction adherence and coherence over long outputs. |
| Researcher Affiliation | Academia | 1Singapore University of Technology and Design EMAIL roy EMAIL |
| Pseudocode | Yes | Algorithm 1 Evaluations Pipeline |
| Open Source Code | Yes | We opensource Long Gen Bench to promote comprehensive evaluation and improvement in this critical area, with code and data available at https://github.com/ mozhu621/Long Gen Bench. |
| Open Datasets | Yes | We introduce Long Gen Bench , a comprehensive dataset that provides a diverse set of tasks specifically designed to evaluate the super-long-form generation capabilities of LLMs across varying token lengths (16K and 32K) and levels of text complexity. |
| Dataset Splits | Yes | For each scenario, we generated 800 examples at two specified lengths: 16K tokens and 32K tokens. |
| Hardware Specification | Yes | Inferences were performed using BFloat16 precision on 8 NVIDIA A800 GPUs, employing greedy decoding to generate the outputs. |
| Software Dependencies | No | The paper mentions "We utilized the v LLM (Kwon et al., 2023) system" and "Huggingface (Wolf et al., 2019)", but does not specify version numbers for these or other software dependencies used in their experimental setup. |
| Experiment Setup | Yes | Task Configurations. For each scenario, we generated 800 examples at two specified lengths: 16K tokens and 32K tokens. The generation was based on designated templates for each model, ensuring task-specific relevance. ... To ensure the relevance of the generated content and prevent off-topic responses or refusals to answer, we prefixed each task input with a carefully curated answer prompt designed to guide the model s output. |