Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Optimal Control for Transformer Architectures: Enhancing Generalization, Robustness and Efficiency

Authors: Kelvin Kan, Xingjian Li, Benjamin J. Zhang, Tuhin Sahai, Stanley Osher, Markos A. Katsoulakis

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct seven extensive experiments on tasks motivated by text generation, sentiment analysis, image classification, and point cloud classification. Experimental results show that the framework improves the test performance of the baselines, while being more parameter-efficient.
Researcher Affiliation	Collaboration	Kelvin Kan Department of Mathematics UCLA EMAIL, Xingjian Li Oden Institute University of Texas at Austin EMAIL, Benjamin J. Zhang School of Data Science and Society UNC Chapel Hill EMAIL, Tuhin Sahai SRI International EMAIL, Stanley Osher Department of Mathematics UCLA EMAIL, Markos A. Katsoulakis Department of Mathematics and Statistics University of Massachusetts Amherst EMAIL
Pseudocode	Yes	Algorithm 1 Forward Propagation: Standard Transformer vs OT-Transformer
Open Source Code	Yes	The source code is publicly available at https://github. com/Kelvin Kan/OT-Transformer.
Open Datasets	Yes	We use the Model Net 40 dataset [90], which is among the most widely used benchmark for point cloud classification [82]. We first conduct a small-scale image classification experiment with the MNIST dataset [52]. We perform sentiment analysis on the IMDb movie review dataset [59]. Lastly, to demonstrate the applicability of our model to large-scale problems, we conduct experiments using the GPT-2 architecture as a baseline on the Open Web Text dataset [36].
Dataset Splits	Yes	MNIST Classification. The dataset consists of hand-written digit images, with 50,000 images used for training and 10,000 images reserved for testing. Cats and Dogs Classification. The dataset contains 25, 000 training samples and 12, 500 test samples
Hardware Specification	Yes	Our implementation is based on Py Torch [66] and experiments are conducted using NVIDIA A100 GPUs with 40GB of memory.
Software Dependencies	No	Our implementation is based on Py Torch [66] and experiments are conducted using NVIDIA A100 GPUs with 40GB of memory.
Experiment Setup	Yes	Hyperparameters, including model architectures, number of training epochs, learning rates, and layer normalization, closely follow the original setups. For the continuous-time models, we use the same architecture except that we put a fully-connected layer before the Transformer blocks so that the dimension is consistent for continuous-time dynamics. Also the hidden dimensions d and k of the ISABs are reduced from 256 to 200. We use an Adam optimizer, with batch size 64, 200 training epochs, and learning rate of 1 10 3.