Zeroth-Order Optimization with Trajectory-Informed Derivative Estimation

Authors: Yao Shu, Zhongxiang Dai, Weicong Sng, Arun Verma, Patrick Jaillet, Bryan Kian Hsiang Low

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Lastly, we use extensive experiments, such as black-box adversarial attack, non-differentiable metric optimization, and derivative-free reinforcement learning, to demonstrate that (a) our trajectory-informed derivative estimation improves over the existing FD methods and that (b) our ZORD algorithm consistently achieves improved query efficiency compared with previous ZO optimization algorithms (Sec. 5).
Researcher Affiliation Academia Dept. of Computer Science, National University of Singapore, Republic of Singapore, Dept. of Electrical Engineering and Computer Science, MIT, USA
Pseudocode Yes Algorithm 1: Standard (Projected) GD with Estimated Derivatives, Algorithm 2: ZORD (Ours)
Open Source Code Yes For our empirical results, we have provided our detailed experimental settings in Appx. C and included our codes in the supplementary materials (i.e., the zip file).
Open Datasets Yes we randomly select an image from MNIST (Lecun et al., 1998) (d = 28 28) or CIFAR-10 (Krizhevsky et al., 2009) (d = 32 32), The Covertype dataset used in Sec. 5.4 is a classification dataset consisting of 581,012 samples from 7 different categories. Each sample from this dataset is a 54-dimensional vector of integers. In this experiment, we randomly split the dataset into training and test sets with each containing 290,506 samples.
Dataset Splits No No explicit mention of training/validation/test dataset splits was found. For the Covertype dataset, it mentions a random split into 'training and test sets with each containing 290,506 samples', but no explicit validation set is specified.
Hardware Specification No No specific hardware specifications (e.g., CPU, GPU models, memory, or cloud instance types) used for running the experiments were provided in the paper.
Software Dependencies No No specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, or specific library versions) were mentioned in the paper.
Experiment Setup Yes Among all our experiments in Sec. 5, the confidence threshold c of our dynamic virtual updates (Sec. 3.2) is set to be 0.35, we consistently use n = 10, λ = 0.01 and directions {ui}n i=1 that are randomly sampled from a unit sphere for the derivative estimation of the FD method (2) applied in the RGF and PRGF algorithm., We use the same Adam optimizer (Kingma and Ba, 2015) with a learning rate of 0.1 and exponential decay rates of 0.9, 0.999 for RGF, PRGF, GD, and our ZORD algorithm, Adam optimizer with the same learning rate of 0.5 and the same exponential decay rates of 0.9, 0.999.