Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

STEAD: Robust Provably Secure Linguistic Steganography with Diffusion Language Model

Authors: Yuang Qi, Na Zhao, Qiyi Yao, Benlong Wu, Weiming Zhang, Nenghai Yu, Kejiang Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 Experiments Detailed experimental settings are provided in Appendix D. Capacity. In Table 1, we have calculated the effective embedding capacity of STEAD under different top-p settings while also displaying the model s average entropy. We compared the embedding capacity with the latest secure robust linguistic steganography method, PSARS [23], which is based on ARM. It can be seen that under the same sampling parameters, the steganographic embedding capacity of STEAD is still significantly higher than that of PSARS (whose secure parameter is set to 32 due to a trade-off between capacity and robustness). Linguistic quality. Figure 5 shows the average perplexity (PPL) values of stegotexts generated by STEAD and covertexts randomly sampled by the same DLM, and it can be seen that under different top-p truncation settings, the PPL of the stegotexts remains consistent with the covertext. Statistical security. We conducted steganalysis tests using various DNN-based steganalyzers. These tests are designed to distinguish between covertext generated from the DLM via random sampling and stegotext generated by STEAD. We generated 1000 pairs of covertext and stegotext with p = 0.9 under two datasets. We adopted three steganalysis methods based on deep learning: FCN [6], R- Bi LSTM-C [7], and LSTMATT [38]. Table 2 shows that the detection error rate PE approaches 50%. This indicates that steganalysis methods cannot perform better than random guessing in detecting stegotext generated by STEAD, which demonstrates the security of our stegosystem. Robustness against token-level attacks. We apply random token-level substitution, insertion, and deletion to stegotexts, each with various intensities. The results are shown in Figure 4. Then we define mixed token-level attacks as attacks that simultaneously apply token-level substitution, insertion, and deletion. We define two attack intensities, weak and strong, using the parameter sets (α = 0.01, β = 1, γ = 1) and (α = 0.1, β = 3, γ = 3), respectively, as shown in Figure 6. It can be seen that at the token level, STEAD is more resistant to various attacks than the comparison methods, whether against a single type of attack or mixed attacks. Robustness against realistic attack scenarios. Figure 7 shows the robustness evaluation to more challenging scenarios. There is a semantic synonym substitution attack at word-level based on Text Attack [39]. With a word substitution rate of 0.1, non-robust methods are rendered almost entirely ineffective, whereas STEAD sustains an extraction correction rate above 80%. Ablation study. Our method comprises three key components: a message-driven PRN sampling algorithm, robust position embedding with error correction coding (RPE+ECC), and a neighborhood search extraction (NSE) strategy. The ablation study for these components is presented in Figure 8.
Researcher Affiliation	Academia	Yuang Qi , Na Zhao , Qiyi Yao, Benlong Wu, Weiming Zhang, Nenghai Yu, Kejiang Chen University of Science and Technology of China Anhui Province Key Laboratory of Digital Security {qiyuang@mail., znzhaona@mail., chenkj@}ustc.edu.cn
Pseudocode	Yes	Algorithm 1 EMBED(P, r, m): an message embedding algorithm based on message-driven pseudorandom number sampling Input: A distribution P, a pseudo-random number r, optional ℓ-bit message m 1: cumul 0 2: if ℓ> 0 then 3: r h r + dec(m) 4: end if 5: for k {0, 1, . . . , \|P\|} do 6: cumul cumul + P(k) 7: if cumul > r then 8: x token corresponding to P(k) 9: break 10: end if 11: end for Output: sampled token v Algorithm 2 EXTRACT(P, r, x): an message extracting algorithm based on message-driven pseudorandom number sampling Input: A distribution P, a pseudo-random number r, a sampled token x 1: cumul, rleft, rright 0 2: for k {0, 1, . . . , \|P\|} do 3: cumul cumul + P(k) 4: if x corresponds to P(k) then 5: rleft cumul P(k) 6: rright cumul 7: break 8: end if 9: end for 10: for m {0, 1, . . . , 2ℓ} do 11: if rleft < h r + dec(m) 12: m bin(m) Succeed to extract 13: return m 14: end if 15: end for 16: return m x Fail to extract Output: Extracted message m
Open Source Code	Yes	Codes: https://github.com/7-yaya/STEAD
Open Datasets	Yes	We adopt the latest advanced text diffusion model Dream [36] as the stego generator. For the autoregressive model, we choose two popular models, Qwen2.5-7B [41] and Deepseek-7B-base [42], with similar performance to Dream to fairly compare the performance of steganographic methods. We randomly select 200 text from the IMDb dataset [43] as input for the text generation task.
Dataset Splits	Yes	We randomly select 200 text from the IMDb dataset [43] as input for the text generation task. Each text is truncated to the first two sentences for context. The model generates 512 tokens for each context as stego or cover under specific sampling settings. We generated 1000 pairs of covertext and stegotext with p = 0.9 under two datasets.
Hardware Specification	No	The paper mentions time-related metrics (encoding/decoding rate) and refers to an 'Implementation Details' section in Appendix D which does not exist, nor does Appendix D contain any specific hardware specifications (GPU/CPU models, memory, etc.).
Software Dependencies	No	The paper mentions using specific models like Dream [36], Qwen2.5-7B [41], and Deepseek-7B-base [42], but does not list any specific software libraries or frameworks with version numbers (e.g., Python, PyTorch, CUDA versions) that were used to implement the methodology.
Experiment Setup	Yes	In the text sampling process of generating text by language models, text quality is influenced by three sampling parameters: temperature, top-p, and top-k. The temperature controls the randomness of the output. The lower the temperature, the more deterministic the text is (which usually leads better quality); versa, the more diverse it is. Top-p sampling (a.k.a. Nucleus Sampling) dynamically selects the smallest set of words with cumulative probability exceeding the threshold p. Top-k sampling samples from the highest probability k words at each step. The three together adjust the deterministic and diversity of the generated text. We test our method under various temperature, top-p, and top-k settings.