Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

StegoZip: Enhancing Linguistic Steganography Payload in Practice with Large Language Models

Authors: Jun Jiang, Zijin Yang, Weiming Zhang, Nenghai Yu, Kejiang Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experimental results in Table 1 demonstrate that the proposed Stego Zip framework significantly enhances the original steganographic system, achieving a 2.5 the payload of the baselines. This improvement is attributed to the integration of Dynamic Semantic Redundancy Pruning (DSRP) and probability-driven Index Compression Coding (ICC), which collaboratively compresses lexical units in secret messages with high efficiency.
Researcher Affiliation	Academia	Jun Jiang1,2, Zijin Yang1,2, Weiming Zhang1, Nenghai Yu1, Kejiang Chen1,2 1. University of Science and Technology of China, China 2. Anhui Province Key Laboratory of Digital Security, China {jungle0430@mail., chenkj@}ustc.edu.cn
Pseudocode	No	The paper describes the methodology in Section 3 'Methodology' using descriptive text and a diagram (Figure 2), but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Codes: https://github.com/Jungle0430/Stego Zip
Open Datasets	Yes	Datasets. Text datasets are used to fine-tune the restorer and generate stego text. For fine-tuning, the IMDb [36] dataset (average length: 1,300 characters) is split into 25,000 training texts and 25,000 test texts, but only 2,000 are randomly sampled for testing. For the AGNews [37] dataset (average length: 241 characters), only the business category is selected and divided into 30,000 training texts and 1,900 test texts. For stego text generation, the Wiki Text-2-v1 [38] dataset is used.
Dataset Splits	Yes	For fine-tuning, the IMDb [36] dataset (average length: 1,300 characters) is split into 25,000 training texts and 25,000 test texts, but only 2,000 are randomly sampled for testing. For the AGNews [37] dataset (average length: 241 characters), only the business category is selected and divided into 30,000 training texts and 1,900 test texts.
Hardware Specification	Yes	All our experiments are conducted on a hardware platform equipped with an Intel(R) Xeon(R) Gold 6130 CPU operating at 2.10 GHz, 256 GB of RAM, and NVIDIA A6000 GPU cards.
Software Dependencies	No	The paper mentions various LLM models used (e.g., Qwen2.5-7B, Deep Seek-R1-Distill Llama-8B, LLa MA2-7B) and techniques like Lo RA, but it does not provide specific version numbers for underlying software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup	Yes	In this paper, we utilize two widely used LLMs: Qwen2.5-7B [32] and Deep Seek-R1-Distill Llama-8B (DS-Llama-8B) [33] for index-based coding and restoration tasks, and greedy sampling is employed. To prevent fine-tuning that could compromise the security of the original generative steganography algorithm, we use separate LLa MA2-7B [34] for stego text generation. In this process, random sampling with a temperature of 0.9 is applied, without incorporating top-p or top-k sampling. We use Lo RA [35] to fine-tune the base LLMs for two epochs, and more detailed experimental settings are shown in the Appendix C. [...] In the main experiment, the parameters for the proposed Dynamic Semantic Redundancy Pruning are set to α = 0.4 and η = 1.0.