Zeroth-Order Optimization with Trajectory-Informed Derivative Estimation
Authors: Yao Shu, Zhongxiang Dai, Weicong Sng, Arun Verma, Patrick Jaillet, Bryan Kian Hsiang Low
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Lastly, we use extensive experiments, such as black-box adversarial attack, non-differentiable metric optimization, and derivative-free reinforcement learning, to demonstrate that (a) our trajectory-informed derivative estimation improves over the existing FD methods and that (b) our ZORD algorithm consistently achieves improved query efficiency compared with previous ZO optimization algorithms (Sec. 5). |
| Researcher Affiliation | Academia | Dept. of Computer Science, National University of Singapore, Republic of Singapore, Dept. of Electrical Engineering and Computer Science, MIT, USA |
| Pseudocode | Yes | Algorithm 1: Standard (Projected) GD with Estimated Derivatives, Algorithm 2: ZORD (Ours) |
| Open Source Code | Yes | For our empirical results, we have provided our detailed experimental settings in Appx. C and included our codes in the supplementary materials (i.e., the zip file). |
| Open Datasets | Yes | we randomly select an image from MNIST (Lecun et al., 1998) (d = 28 28) or CIFAR-10 (Krizhevsky et al., 2009) (d = 32 32), The Covertype dataset used in Sec. 5.4 is a classification dataset consisting of 581,012 samples from 7 different categories. Each sample from this dataset is a 54-dimensional vector of integers. In this experiment, we randomly split the dataset into training and test sets with each containing 290,506 samples. |
| Dataset Splits | No | No explicit mention of training/validation/test dataset splits was found. For the Covertype dataset, it mentions a random split into 'training and test sets with each containing 290,506 samples', but no explicit validation set is specified. |
| Hardware Specification | No | No specific hardware specifications (e.g., CPU, GPU models, memory, or cloud instance types) used for running the experiments were provided in the paper. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, or specific library versions) were mentioned in the paper. |
| Experiment Setup | Yes | Among all our experiments in Sec. 5, the confidence threshold c of our dynamic virtual updates (Sec. 3.2) is set to be 0.35, we consistently use n = 10, λ = 0.01 and directions {ui}n i=1 that are randomly sampled from a unit sphere for the derivative estimation of the FD method (2) applied in the RGF and PRGF algorithm., We use the same Adam optimizer (Kingma and Ba, 2015) with a learning rate of 0.1 and exponential decay rates of 0.9, 0.999 for RGF, PRGF, GD, and our ZORD algorithm, Adam optimizer with the same learning rate of 0.5 and the same exponential decay rates of 0.9, 0.999. |