Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

EncryptedLLM: Privacy-Preserving Large Language Model Inference via GPU-Accelerated Fully Homomorphic Encryption

Authors: Leo De Castro, Daniel Escudero, Adya Agrawal, Antigoni Polychroniadou, Manuela Veloso

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our approach achieves runtimes that are over 200 faster than the CPU baseline. We also present novel and extensive experimental analysis of approximations of LLM activation functions to maintain accuracy while achieving this performance. ... 4. Experimental Results
Researcher Affiliation	Industry	1J.P. Morgan Chase Cybersecurity & Technology Controls, New York, New York, USA 2J.P. Morgan AI Research & Algo CRYPT Co E, New York, New York, USA. Correspondence to: Leo de Castro <EMAIL>.
Pseudocode	No	The paper describes algorithms like Newton's iterative method and Goldschmidt algorithm in text format within sections 3.3 and 3.4, but does not present them as a clearly labeled or structured pseudocode/algorithm block.
Open Source Code	Yes	We have open-sourced the code of our Open FHE+GPU extension 3, which we believe will be of independent interest. ... 3https://github.com/leodec/openfhe-gpu-public ... for reproducibility we also open source our modiﬁed Hugging Face GPT-2 implementation.
Open Datasets	Yes	We focus speciﬁcally on the GPT-2 architecture by Open AI, which is fully open-source... We are able to leverage the Language Model Evaluation Harness library (https://github.c om/Eleuther AI/lm-evaluation-harness), which includes multiple benchmarks to evaluate LLM performance. Our accuracy benchmarks appear in table 1, where we measure the performance of our modiﬁcations with respect to the baseline GPT2 Small, GPT2 Medium and GPT2 Large models. We run eight diverse tasks: Hella Swag, ARC (Easy), PIQA, Social IQa, MNLI, SST-2, ANLI, and Wi C.
Dataset Splits	No	The paper mentions using the Language Model Evaluation Harness library for evaluation on several tasks, which typically use predefined splits. However, it does not explicitly provide specific dataset split information (percentages, sample counts, or detailed splitting methodology) for these benchmarks within the text.
Hardware Specification	Yes	This machine has an Intel Xeon chip running at 2.4 GHz and 2 TB of RAM as well as an NVIDIA A100 80GB PCIe. ... The CPU benchmarks were run on a machine with an Intel Xeon chip running at 2.4 GHz and 2 TB of RAM. The GPU benchmarks were run on the same machine and used an NVIDIA A100 80GB PCIe
Software Dependencies	No	We extend the capabilities of Open FHE by enabling a GPU-based workﬂow... We modify the GPT-2 implementation from Hugging Face s transformers library... and thoroughly benchmark the resulting accuracy using the LM evaluation harness library... The paper mentions software like 'Open FHE', 'Hugging Face transformers library', and 'LM evaluation harness library' but does not provide specific version numbers for these components.
Experiment Setup	Yes	We use degree 2 for the f and g polynomials in the comparison from Section 3.1, and we compose them 2 times each. ... We use 16 or 18 Newton iterations depending on model size as shown in table 2. ... For the approximation of exp we use r = 7, and for Goldschmidt algorithm used for the division we use 14/18/22 iterations based on model size as shown in table 2.