Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

An Engorgio Prompt Makes Large Language Model Babble on

Authors: Jianshuo Dong, Ziyuan Zhang, Qingjie Zhang, Tianwei Zhang, Hao Wang, Hewu Li, Qi Li, Chao Zhang, Ke Xu, Han Qiu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on 13 opensourced LLMs with parameters ranging from 125M to 30B. The results show that Engorgio prompts can successfully induce LLMs to generate abnormally long outputs (i.e., roughly 2-13 longer to reach 90%+ of the output length limit) in a white-box scenario and our real-world experiment demonstrates Engergio’s threat to LLM service with limited computing resources. The code is released at: https://github.com/jianshuod/Engorgio-prompt. To prove the effectiveness of Engorgio, we conduct extensive experiments over 6 base models and 7 supervised fine-tuned (SFT) models with parameters ranging from 125M to 30B, as listed in Table 5.
Researcher Affiliation	Academia	1Tsinghua University, 2Nanyang Technological University EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology in text and provides a pipeline diagram (Figure 2), but it does not contain explicit pseudocode or algorithm blocks.
Open Source Code	Yes	The code is released at: https://github.com/jianshuod/Engorgio-prompt.
Open Datasets	Yes	Normal inputs: we collect 50 samples from the training dataset for Standford-alpaca3, which are generated by Open AI’s text-davinci-003, and 50 samples from Share GPT4, a website where people can share their Chat GPT conversations. 3https://github.com/tatsu-lab/stanford-alpaca/ 4https://sharegpt.com/
Dataset Splits	No	The paper mentions collecting 50 samples from specific datasets for baselines but does not provide details on training/test/validation splits used for its own method or the LLMs it targets.
Hardware Specification	Yes	We utilize the Hugging Face inference endpoint5 as our cloud service, deploying Stable LM (maximal length of 4096) as the target LLM. Our experiments explore three GPU configurations: 1 Nvidia A10, 4 Nvidia A10, and 2 Nvidia A100, aiming to demonstrate how a small number of attackers can significantly compromise the service’s performance.
Software Dependencies	No	The paper mentions using 'Adam optimizer' and 'Gumbel-Softmax' but does not specify version numbers for these software components or any underlying programming languages or libraries used for implementation.
Experiment Setup	Yes	We use the Adam optimizer with a learning rate of 0.1 to update the distribution matrix θ. We allow a maximum of 300 optimization steps, the cost of which is acceptable, especially when considering the reusability as explained in Appendix A.6. The Gumbel-Softmax temperature factor τ is set to 1, and the default Engorgio prompt length is t = 32. The input length of normal inputs, special inputs, LLMEffi Checker, and sponge examples is roughly the same as Engorgio to ensure fairness. The loss coefficient λ is empirically set to 1.