Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

An Engorgio Prompt Makes Large Language Model Babble on

Authors: Jianshuo Dong, Ziyuan Zhang, Qingjie Zhang, Tianwei Zhang, Hao Wang, Hewu Li, Qi Li, Chao Zhang, Ke Xu, Han Qiu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on 13 opensourced LLMs with parameters ranging from 125M to 30B. The results show that Engorgio prompts can successfully induce LLMs to generate abnormally long outputs (i.e., roughly 2-13 longer to reach 90%+ of the output length limit) in a white-box scenario and our real-world experiment demonstrates Engergio’s threat to LLM service with limited computing resources. The code is released at: https://github.com/jianshuod/Engorgio-prompt. To prove the effectiveness of Engorgio, we conduct extensive experiments over 6 base models and 7 supervised fine-tuned (SFT) models with parameters ranging from 125M to 30B, as listed in Table 5.
Researcher Affiliation Academia 1Tsinghua University, 2Nanyang Technological University EMAIL, EMAIL
Pseudocode No The paper describes the methodology in text and provides a pipeline diagram (Figure 2), but it does not contain explicit pseudocode or algorithm blocks.
Open Source Code Yes The code is released at: https://github.com/jianshuod/Engorgio-prompt.
Open Datasets Yes Normal inputs: we collect 50 samples from the training dataset for Standford-alpaca3, which are generated by Open AI’s text-davinci-003, and 50 samples from Share GPT4, a website where people can share their Chat GPT conversations. 3https://github.com/tatsu-lab/stanford-alpaca/ 4https://sharegpt.com/
Dataset Splits No The paper mentions collecting 50 samples from specific datasets for baselines but does not provide details on training/test/validation splits used for its own method or the LLMs it targets.
Hardware Specification Yes We utilize the Hugging Face inference endpoint5 as our cloud service, deploying Stable LM (maximal length of 4096) as the target LLM. Our experiments explore three GPU configurations: 1 Nvidia A10, 4 Nvidia A10, and 2 Nvidia A100, aiming to demonstrate how a small number of attackers can significantly compromise the service’s performance.
Software Dependencies No The paper mentions using 'Adam optimizer' and 'Gumbel-Softmax' but does not specify version numbers for these software components or any underlying programming languages or libraries used for implementation.
Experiment Setup Yes We use the Adam optimizer with a learning rate of 0.1 to update the distribution matrix θ. We allow a maximum of 300 optimization steps, the cost of which is acceptable, especially when considering the reusability as explained in Appendix A.6. The Gumbel-Softmax temperature factor τ is set to 1, and the default Engorgio prompt length is t = 32. The input length of normal inputs, special inputs, LLMEffi Checker, and sponge examples is roughly the same as Engorgio to ensure fairness. The loss coefficient λ is empirically set to 1.