Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Safety Depth in Large Language Models: A Markov Chain Perspective

Authors: Ching-Chia Kao, Chia-Mu Yu, Chun-Shien Lu, Chu-Song Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To reach this safety depth effectively, we propose a cyclic group augmentation strategy that improves safety scores across six LLMs. In addition, we uncover a critical interaction between safety depth and ensemble width, demonstrating that larger ensembles can offset shallower alignments. These results suggest that test-time computation, often overlooked in safety alignment, can play a key role. 5 Experiments In this section, we begin by presenting a toy example to validate our theoretical results, then offer illustrative cases using open-source LLMs.
Researcher Affiliation Academia 1Institute of Information Science & Research Center for Information Technology Innovation, Academia Sinica 2Department of Computer Science and Information Engineering, National Taiwan University 3Department of Electronics and Electrical Engineering, National Yang Ming Chiao Tung University
Pseudocode Yes Algorithm 1: Transition Matrix Normalization Input: Initial matrix Q0, bias matrix B, parameters α, γ, T Output: Updated stochastic matrix QT
Open Source Code Yes Answer: [Yes] Justification: We attach the code in the supplementary material. We will release the code publicly upon acceptance.
Open Datasets Yes For training, we employed the Malicious Instruct dataset [Huang et al., 2024] of 100 harmful instructions with three data augmentation strategies (shallow, deep, cyclic). Testing was conducted on the HEx-PHI dataset,4 https://huggingface.co/datasets/LLM-Tuning-Safety/HEx-PHI
Dataset Splits Yes For training, we utilized Malicious Instruct, a dataset containing 100 harmful instructions from [Huang et al., 2024]... Testing was conducted on the HEx-PHI dataset,4 https://huggingface.co/datasets/LLM-Tuning-Safety/HEx-PHI, which contains 330 harmful instructions spanning 11 prohibited categories.
Hardware Specification Yes Computing Environment. A machine with at least one GPU (e.g., NVIDIA Tesla V100 or better).
Software Dependencies No To optimize memory efficiency while maintaining model performance, we employed 4bit precision quantization using the bits-and-bytes library. The quantization configuration utilized the normal-float4 (NF4) format with double quantization to minimize quantization errors while reducing memory requirements. We implemented parameter-efficient fine-tuning using LoRA adapters with a rank of 8 and scaling factor (alpha) of 32. The adapters were applied to key transformer components including query, key, value projections, and feed-forward layers. Training proceeded for 3 epochs with a batch size of 4 and gradient accumulation steps of 4, resulting in an effective batch size of 16. We employed a learning rate of 2e-4 with the 8-bit AdamW optimizer to further optimize memory usage while maintaining training stability.
Experiment Setup Yes Training proceeded for 3 epochs with a batch size of 4 and gradient accumulation steps of 4, resulting in an effective batch size of 16. We employed a learning rate of 2e-4 with the 8-bit AdamW optimizer to further optimize memory usage while maintaining training stability.