Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
PROFIT: A Specialized Optimizer for Deep Fine Tuning
Authors: Anirudh Chakravarthy, Shuai Zheng, Xin Huang, Sachithra Hemachandra, Xiao Zhang, Yuning Chai, Zhao Chen
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments We now detail a number of experiments for PROFIT in diverse settings: image classification, visual task adaptation, and large-scale motion prediction. We primarily focus on comparisons to standard fine-tuning with commonly used optimizers (on either the full model or just the model head), as those are by a large margin still the most commonly used fine-tuning methods in the industry due to their known performance and ease of implementation. We will show that PROFIT, while easy to implement, provides a significant performance boost in all cases. |
| Researcher Affiliation | Industry | Anirudh S Chakravarthy Shuai Kyle Zheng Xin Huang Sachithra Hemachandra Xiao Zhang Yuning Chai Zhao Chen GM Cruise LLC |
| Pseudocode | Yes | Algorithm 1 PROFIT: A fine-tuning optimizer Require: Converged model M(x; θ) with trainable weights θ and input x, to be trained on data from similar domain X with loss L. Require: Initialize reference model weights θref. Require: Initialize batch size B, reference steps nref, training steps nsteps, standard optimizer O with learning rate λmain, and reference optimizer O(ref) with learning rate λref. Each optimizer takes as arguments the current weights and a gradient update direction, producing updated weight values. 1: for nstep steps do: 2: θref θ Save the model state. 3: for nref steps do 4: Take new B examples from X and calculate gradients g := θL. 5: Take one step with reference optimizer θ O(ref)(θ, g). 6: end for 7: Calculate = θ θref. Calculate total displacement during reference steps. 8: Find g := θL for a new batch as in Line 4. 9: Calculate dot product ω = , g . 10: if ω < 0 then: 11: g g a b denotes orthogonalizing a with respect to b 12: end if 13: θ θref. Restore original state. 14: θ O(θ, g) Take step with main optimizer. 15: end for |
| Open Source Code | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: The submission provided sufficient details for other to reproduce the algorithm, but the code is not immediately released. |
| Open Datasets | Yes | 4.2 Image Classification Next, we demonstrate the effectiveness of PROFIT for image classification. CIFAR10 and CIFAR100 (Krizhevsky and Hinton [2009]) are the de facto benchmarks for image classification. [...] 4.3 Visual Task Adaptation Benchmark The VTAB-1K Zhai et al. [2019] dataset is a popular representation learning benchmark [...] 4.4 Multimodal Vision-Language Models (VLM) [...] Drive LM (Sima et al. [2023]) is a visual question-answering (VQA) benchmark [...] 4.5 Large-Scale Robotics Motion Prediction We evaluate PROFIT on the Waymo Open Motion Dataset (WOMD) Ettinger et al. [2021], a largescale driving dataset. |
| Dataset Splits | Yes | Our toy example ground truth is the function f(x) = sin(10|x|) with input in R2. This function was chosen because of its extreme nonlinearity and difficulty in fitting by standard neural networks. To further increase the challenge, for the training data, normal noise of size N(0, 1) is added, while no noise is added to the test data. The original dataset consists of 50000 points with both dimensions between -1 and 1, while the fine-tune dataset consists of 50000 points with both dimensions between 0.8 and 1.5. [...] C.5.1 Implementation Details [...] Each dataset contains 800 training examples and 200 validation examples. |
| Hardware Specification | Yes | We use 4 Tesla T4 GPUs for all our experiments and quantify the increase in GPU memory consumption and train time by PROFIT in Table 11. [...] We use 8 H100 GPUs for experiments. |
| Software Dependencies | No | RMSProp is used with default Py Torch hyperparameters (α = 0.99, ϵ = 1e 8) and the learning rate 1e 2 for fitting to the original distribution, with a learning rate decay of 0.9 every 500 steps. After fitting the original distribution, we fine-tune to the new distribution for 1500 steps at a learning rate of 5e 4, with a decay factor of 0.95 every 100 steps. PROFIT is run with nref = 1. (C.1). The paper mentions PyTorch and Chat GPT, but does not provide specific version numbers for these software components, which is required for a reproducible description of ancillary software. |
| Experiment Setup | Yes | RMSProp is used with default Py Torch hyperparameters (α = 0.99, ϵ = 1e 8) and the learning rate 1e 2 for fitting to the original distribution, with a learning rate decay of 0.9 every 500 steps. After fitting the original distribution, we fine-tune to the new distribution for 1500 steps at a learning rate of 5e 4, with a decay factor of 0.95 every 100 steps. PROFIT is run with nref = 1. |