Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AegisGuard: RL-Guided Adapter Tuning for TEE-Based Efficient & Secure On-Device Inference

Authors: CHE WANG, Ziqi Zhang, Yinggui Wang, Tiantong Wang, Yurong Hao, Jianbo Gao, Tao Wei, Yang Cao, Zhong Chen, Wei Yang Bryan Lim

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that Aegis Guard achieves black-box level MS resilience (surrogate accuracy around 39%, matching fully shielded baselines), while reducing end-to-end inference latency by 2 3 and cutting TEE memory usage by 4 compared to state-of-the-art TSDP methods.
Researcher Affiliation	Collaboration	Che Wang1,2 , Ziqi Zhang3, Yinggui Wang5, Tiantong Wang2, Yurong Hao2, Jianbo Gao4, Tao Wei5, Yang Cao6, Zhong Chen1, Wei Yang Bryan Lim2 School of Computer Science, Peking University1, Nanyang Technological University2, University of Illinois Urbana-Champaign3, Beijing Jiaotong University4, Ant Group5, Institute of Science Tokyo6
Pseudocode	Yes	A Pseudocode algorithm of Aegis Guard Algorithm 1 Aegis Guard: RL-based Sensitivity Measurement & Shielded Adapter Compression
Open Source Code	No	All artifacts will be released through a public Git Hub repository upon paper acceptance, ensuring full transparency and replicability of our research.
Open Datasets	Yes	Models and Datasets. We select large models from different domains and sizes, including generative models (OPT-2.7B [44], LLa MA-7B [35]) and vision transformer (Vi T-Base, Vi T-Large-14[42]). We employ Lo RA [15] for parameter-efficient fine-tuning. For the dataset, we use Common Sense QA[16] to fine-tune generative models. We evaluate the model performance using six popular question-answering benchmarks: ARC-Challenge and ARC-Easy[7], Hella Swag [41], OBQA [21], PIQA [5], and Wino Grande [29]. For Vi T models, we use six diverse datasets: CIFAR10, CIFAR100, [18], UTKFace [46], MNIST [8], GTSRB [32], and SUN397 [38].
Dataset Splits	Yes	Model Stealing Attack Setup. Model stealing attacks aim to extract a surrogate model that replicates the functionality of the victim model. To simulate this, we adopt the same architecture as the victim model and fine-tune it using a limited query-based training dataset, which consists of approximately 1% of the full dataset. This setting has been shown to be realistic and practical in real-world scenarios [47].
Hardware Specification	Yes	Implementation. We implement all code in Py Torch 2.5.1. The fine-tuning is conducted on a server with one NVIDIA A6000 GPUs. For inference, we follow existing work [47] to build a prototype framework on a PC with an Intel SGX enclave (SDK 2.6, GCC 7.5), and an NVIDIA RTX4090D 24GB GPU to evaluate the on-device LM performance.
Software Dependencies	Yes	Implementation. We implement all code in Py Torch 2.5.1. The fine-tuning is conducted on a server with one NVIDIA A6000 GPUs. For inference, we follow existing work [47] to build a prototype framework on a PC with an Intel SGX enclave (SDK 2.6, GCC 7.5), and an NVIDIA RTX4090D 24GB GPU to evaluate the on-device LM performance.
Experiment Setup	Yes	C.1 Parameter Efficient Fine Tuning During adapter fine-tuning for general tasks, we primarily apply Lo RA adapters to the multi-head attention and feedforward layers, specifically to the query, key, value, and dense components. The exact placement of Lo RA modules depends on the model s performance on task-specific datasets. Our goal is to minimize the number of trainable parameters while maintaining competitive accuracy. For instance, in relatively simple tasks such as CIFAR-10, Lo RA is only applied to the query and value projections. In contrast, for more challenging NLP tasks, we insert Lo RA adapters into the query, key, value, and dense components of each layer. The rank hyperparameter of Lo RA is chosen from 16, 32, 64, depending on the size of the model and the dataset. For dynamic pruning, we set the total pruning ratio with 20% and 50% to balance performance and efficiency. The pruning frequency is set to 20 steps, and the warm-up ratio is set to 0.1. We use a micro-batch size is 8,32 and gradient-accumulation-steps as 8,1 for LLa MA and Vi T, respectively. The training epochs of nlp tasks are setting in range of [2,5]. For vision tasks, the epoch is setting from 6 to 20 depends on the task complexity. We employ the Adam W as optimizer, experimenting with a range of learning rates: [2e-5,4e-5,2e-4,3e-4].