Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
EncryptedLLM: Privacy-Preserving Large Language Model Inference via GPU-Accelerated Fully Homomorphic Encryption
Authors: Leo De Castro, Daniel Escudero, Adya Agrawal, Antigoni Polychroniadou, Manuela Veloso
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our approach achieves runtimes that are over 200 faster than the CPU baseline. We also present novel and extensive experimental analysis of approximations of LLM activation functions to maintain accuracy while achieving this performance. ... 4. Experimental Results |
| Researcher Affiliation | Industry | 1J.P. Morgan Chase Cybersecurity & Technology Controls, New York, New York, USA 2J.P. Morgan AI Research & Algo CRYPT Co E, New York, New York, USA. Correspondence to: Leo de Castro <EMAIL>. |
| Pseudocode | No | The paper describes algorithms like Newton's iterative method and Goldschmidt algorithm in text format within sections 3.3 and 3.4, but does not present them as a clearly labeled or structured pseudocode/algorithm block. |
| Open Source Code | Yes | We have open-sourced the code of our Open FHE+GPU extension 3, which we believe will be of independent interest. ... 3https://github.com/leodec/openfhe-gpu-public ... for reproducibility we also open source our modified Hugging Face GPT-2 implementation. |
| Open Datasets | Yes | We focus specifically on the GPT-2 architecture by Open AI, which is fully open-source... We are able to leverage the Language Model Evaluation Harness library (https://github.c om/Eleuther AI/lm-evaluation-harness), which includes multiple benchmarks to evaluate LLM performance. Our accuracy benchmarks appear in table 1, where we measure the performance of our modifications with respect to the baseline GPT2 Small, GPT2 Medium and GPT2 Large models. We run eight diverse tasks: Hella Swag, ARC (Easy), PIQA, Social IQa, MNLI, SST-2, ANLI, and Wi C. |
| Dataset Splits | No | The paper mentions using the Language Model Evaluation Harness library for evaluation on several tasks, which typically use predefined splits. However, it does not explicitly provide specific dataset split information (percentages, sample counts, or detailed splitting methodology) for these benchmarks within the text. |
| Hardware Specification | Yes | This machine has an Intel Xeon chip running at 2.4 GHz and 2 TB of RAM as well as an NVIDIA A100 80GB PCIe. ... The CPU benchmarks were run on a machine with an Intel Xeon chip running at 2.4 GHz and 2 TB of RAM. The GPU benchmarks were run on the same machine and used an NVIDIA A100 80GB PCIe |
| Software Dependencies | No | We extend the capabilities of Open FHE by enabling a GPU-based workflow... We modify the GPT-2 implementation from Hugging Face s transformers library... and thoroughly benchmark the resulting accuracy using the LM evaluation harness library... The paper mentions software like 'Open FHE', 'Hugging Face transformers library', and 'LM evaluation harness library' but does not provide specific version numbers for these components. |
| Experiment Setup | Yes | We use degree 2 for the f and g polynomials in the comparison from Section 3.1, and we compose them 2 times each. ... We use 16 or 18 Newton iterations depending on model size as shown in table 2. ... For the approximation of exp we use r = 7, and for Goldschmidt algorithm used for the division we use 14/18/22 iterations based on model size as shown in table 2. |