Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Memory-Efficient Training with In-Place FFT Implementation

Authors: XINYU DING, Bangtian Liu, Siyu Liao, Zhongfeng Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on multiple natural language understanding tasks demonstrate the method effectiveness in reducing training memory cost, offering a promising direction for frequency-domain lightweight adaptation. To evaluate the memory efficiency of our proposed in-place training, we conduct experiments in two settings: (1) single-layer analysis: we perform training on a singular linear layer with different training methods... (2) full-model training: we apply the circulant fine-tuning approach [10] to Ro BERTa-large and LLa MA2-7B and monitor memory usage throughout training.
Researcher Affiliation Academia 1School of Integrated Circuits, Sun Yat-sen University EMAIL, EMAIL,
Pseudocode No The paper describes the algorithm and method in prose and equations within Section 4, 'Our Method', and lists steps, but does not present them in a formally labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes We will release anonymized code and scripts in the supplementary material to reproduce the main experimental results. The provided package includes our in-place FFT implementation, baseline methods, model loading procedures (e.g., for LLa MA-7B and Ro BERTa-Large), memory profiling scripts, and all commands needed to reproduce the bar charts and analyses reported in the paper. Instructions on environment setup and dependencies are also included for reproducibility.
Open Datasets Yes For LLa MA2-7B, we use the GSM8K dataset with per_device_train_batch_size set to 2 and gradient_accumulation_steps set to 4. For Ro BERTa-large, we use the MRPC dataset with a batch size of 32.
Dataset Splits Yes For LLa MA2-7B, we use the GSM8K dataset with per_device_train_batch_size set to 2 and gradient_accumulation_steps set to 4. For Ro BERTa-large, we use the MRPC dataset with a batch size of 32. These configurations follow the standard precision training setups for each task [10]
Hardware Specification Yes To isolate the memory overhead introduced by different FFT implementations, we conduct controlled experiments on a single fine-tuned layer using an NVIDIA A100 GPU. We conduct full-model experiments on both LLa MA2-7B and Ro BERTa-large using an NVIDIA A100 GPU. Runtime (RT, in ms) is measured on an A800 GPU with FP32 precision, averaged over 1000 runs. Token-level throughput (Thr., in k tokens/sec) is measured on LLa MA-2-7B using the GSM8K dataset with one A800 GPU.
Software Dependencies No We conduct full-model experiments on both LLa MA2-7B and Ro BERTa-large using an NVIDIA A100 GPU. All our experiments compare three different FFT implementations: (1) fft: standard complex-valued FFT using torch.fft.fft/ifft from Py Torch [25]; (2) rfft: real-input FFT using torch.fft.rfft/irfft, exploiting Hermitian symmetry; (3) ours: custom CUDA-based real-domain FFT with in-place forward/backward implementation, reusing the input real-valued memory for intermediate result storage. The paper mentions PyTorch but does not provide specific version numbers for PyTorch or CUDA, which are key software components.
Experiment Setup Yes For LLa MA2-7B, we use the GSM8K dataset with per_device_train_batch_size set to 2 and gradient_accumulation_steps set to 4. For Ro BERTa-large, we use the MRPC dataset with a batch size of 32. We use stochastic gradient descent (SGD) as the optimizer in all experiments.