Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Iron: Private Inference on Transformers
Authors: Meng Hao, Hongwei Li, Hanxiao Chen, Pengzhi Xing, Guowen Xu, Tianwei Zhang
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments conducted on several real-world datasets and models demonstrate that Iron achieves 3 14 less communication and 3 11 less runtime compared to the prior art. |
| Researcher Affiliation | Academia | Meng Hao1 Hongwei Li1 Hanxiao Chen1 Pengzhi Xing1 Guowen Xu2 Tianwei Zhang2 1University of Electronic Science and Technology of China 2Nanyang Technological University |
| Pseudocode | Yes | Algorithm 1 Secure Matrix Multiplication Protocol; Algorithm 2 Secure Softmax Protocol; Algorithm 3 Secure GELU Protocol; Algorithm 4 Secure Layer Norm Protocol. |
| Open Source Code | No | The paper states 'Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No]' in its ethics checklist. While it mentions building on existing libraries, it does not provide the code for Iron itself. |
| Open Datasets | Yes | We train the models for four NLP tasks over the datasets of the Stanford Sentiment Treebank (SST-2), the Microsoft Research Paraphrase Corpus (MRPC), the Multi-Genre Natural Language Inference Corpus (MNLI) and the Stanford Question Answering Dataset (QNLI) from GLUE benchmarks [18]. |
| Dataset Splits | Yes | We train the models for four NLP tasks over the datasets of the Stanford Sentiment Treebank (SST-2), the Microsoft Research Paraphrase Corpus (MRPC), the Multi-Genre Natural Language Inference Corpus (MNLI) and the Stanford Question Answering Dataset (QNLI) from GLUE benchmarks [18]. The GLUE benchmarks are standard datasets with predefined train, validation, and test splits. |
| Hardware Specification | Yes | All the following experiments are performed on AWS c5.9xlarge instances with Intel Xeon 8000 series CPUs at 3.6GHz. |
| Software Dependencies | No | Iron is built on top of the SEAL library [32] and the EMP toolkit [33] in C++. We also use the Ez PC framework [34]. No specific version numbers are provided for SEAL, EMP toolkit, or Ez PC. |
| Experiment Setup | Yes | These models are parameterized by three hyper-parameters: the number of blocks, the dimension of representations and the number of input tokens (refer to Appendix A.4.1 for the hyper-parameters of these models). |