Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Analog Foundation Models

Authors: Julian Büchel, Iason Chalas, Giovanni Acampa, An Chen, Omobayode Fagbohungbe, Hsinyu Tsai, Kaoutar El Maghraoui, Manuel Le Gallo, Abbas Rahimi, Abu Sebastian

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our approach enables state-of-the-art models including Phi-3-mini-4k-instruct and Llama-3.2-1B-Instruct to retain performance comparable to 4-bit weight, 8-bit activation baselines, despite the presence of analog noise and quantization constraints. Additionally, we show that as a byproduct of our training methodology, analog foundation models can be quantized for inference on low-precision digital hardware. Finally, we show that our models also benefit from test-time compute scaling, showing better scaling behavior than models trained with 4-bit weight and 8-bit static input quantization. Our work bridges the gap between high-capacity LLMs and efficient analog hardware, offering a path toward energy-efficient foundation models. Code is available at https://github.com/IBM/analog-foundation-models.
Researcher Affiliation	Collaboration	Julian Büchel1,2, Iason Chalas1,2, Giovanni Acampa1,2, An Chen3 Omobayode Fagbohungbe4, Sidney Tsai3, Kaoutar El Maghraoui4 Manuel Le Gallo1, Abbas Rahimi1, Abu Sebastian1 1IBM Research Zurich, 2ETH Zürich 3IBM Research Almaden, 4IBM Thomas J. Watson Research Center EMAIL EMAIL EMAIL
Pseudocode	No	The paper includes illustrations (Figure 2 and Figure 11) depicting the training pipeline and deployment process, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/IBM/analog-foundation-models.
Open Datasets	Yes	For training our analog foundation models, we use 20B tokens which are synthetically generated using v LLM [61]. After data generation, we train our models on 96 V100 GPUs. Because of the V100 s limited DRAM capacity, we use Deep Speed Ze RO stage 2, which includes gradientand optimizer state partitioning. We also use activation checkpointing and CPU offloading to further reduce memory consumption. For both models, we train with a maximum sequence length of 4096 which is also the chunk size used during data generation. Training of the Phi-3-mini-4k-instruct-based analog foundation model takes about 230h, while training the smaller Llama-3.2-1B-Instruct-based models takes about 90h. When using GPUs with more memory, the time and number of required GPUs reduces drastically as training is more efficient. For example, training a Phi-3-mini-4k-instruct-based model on 8 A100s takes the same time as training it on 48 V100s.
Dataset Splits	No	For training our analog foundation models, we use 20B tokens which are synthetically generated using v LLM [61]. The paper mentions the total number of synthetically generated tokens used for training (20B) and the sequence length (4096). It also evaluates on various benchmarks like MMLU (5-shot), GSM8K (CoT 8-shot), etc., and Table 14 lists the number of test samples for these benchmarks. However, it does not explicitly specify how the 20B training tokens were split into training, validation, or test sets for its own model training process.
Hardware Specification	Yes	For training our analog foundation models, we use 20B tokens which are synthetically generated using v LLM [61]. After data generation, we train our models on 96 V100 GPUs. Because of the V100 s limited DRAM capacity, we use Deep Speed Ze RO stage 2, which includes gradientand optimizer state partitioning. We also use activation checkpointing and CPU offloading to further reduce memory consumption. For both models, we train with a maximum sequence length of 4096 which is also the chunk size used during data generation. Training of the Phi-3-mini-4k-instruct-based analog foundation model takes about 230h, while training the smaller Llama-3.2-1B-Instruct-based models takes about 90h. When using GPUs with more memory, the time and number of required GPUs reduces drastically as training is more efficient. For example, training a Phi-3-mini-4k-instruct-based model on 8 A100s takes the same time as training it on 48 V100s.
Software Dependencies	No	We used AIHWKIT-Lightning [59], an open-source toolkit developed for scalable HWA training based on Py Torch [60]. For both models, we used the Adam W [81] optimizer. While these software components are mentioned, specific version numbers for PyTorch or AIHWKIT-Lightning are not provided.
Experiment Setup	Yes	Generic training During our training runs, we used the Adam W [81] optimizer with β1 = 0.9, β2 = 0.98, and ϵ = 1.0e 06 for both Phi-3-mini-4k-instruct and Llama-3.2-1B-Instruct base models. We employed distillation (with beta=1.0) using a temperature of 2.0 for Phi-3 and 1.0 for Llama. Both models were trained for 2 epochs with a batch size of 96, polynomial learning rate scheduler, warmup ratio of 0.016, and a maximum gradient norm of 1.0. We applied gradient checkpointing and set weight decay to 0.01. The learning rates differed, with 1.0e-06 for Phi-3 and 5.0e-07 for Llama. The learning rate was multiplies by the number of GPUs used (96 for the main experiments). Hardware-aware training We used AIHWKIT-Lightning [59] for HWA training. For both models, we enabled input range learning (decay=0.01, input_min_percentage=0.95) with init_value of 3.0, though the init_std_alpha was 15.0 for Phi-3 and 18.0 for Llama. Interestingly, we found that during the initial 500 training steps, outliers need to be kept almost completely, which is ensured by calibrating the input ranges from data with 15 or 18 times the standard deviation of the activations. After 500 batches, input range learning takes over and input ranges start to tighten due to the gradients, but mostly because of the decay. We found this to be crucial for getting good performance with static input ranges. We used a additive Gaussian noise injection with magnitude modifier.std_dev 0.02 for Phi-3 and 0.03 for Llama. Forward input (inp_res) and output (out_res) resolution was set to 254 for both models, with output bounds (out_bound) of 12 for Phi-3 and 14 for Llama.