Zeroth-Order Optimization Meets Human Feedback: Provable Learning via Ranking Oracles

Authors: Zhiwei Tang, Dmitry Rybin, Tsung-Hui Chang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this study, we delve into an emerging optimization challenge involving a blackbox objective function that can only be gauged via a ranking oracle a situation frequently encountered in real-world scenarios, especially when the function is evaluated by human judges. ... Throughout experiments, we found that ZO-Rank SGD can significantly enhance the detail of generated images with only a few rounds of human feedback.
Researcher Affiliation Academia Zhiwei Tang1,3 , Dmitry Rybin2, Tsung-Hui Chang1,3 School of Science and Engineering1, School of Data Science2 The Chinese University of Hong Kong, Shenzhen, China Shenzhen Research Institute of Big Data3, Shenzhen, China
Pseudocode Yes Algorithm 1 ZO-Rank SGD; Algorithm 2 Line search strategy for gradient-based optimization algorithms; Algorithm 3 Modified ZO-Rank SGD algorithm for optimizing latent embeddings of Stable Diffusion.
Open Source Code No The paper does not include an explicit statement about releasing code or a link to a code repository for the methodology described.
Open Datasets Yes Specifically, we adopt a similar experimental setup as (Cai et al., 2022; Duan et al., 2016), where the goal is to learn a policy for simulated robot control with several problems from the Mu Jo Co suite of benchmarks (Todorov et al., 2012).
Dataset Splits No The paper does not explicitly provide training/validation/test dataset splits with percentages or sample counts. While it mentions tuning hyperparameters, it does not specify data partitioning for these processes.
Hardware Specification No The paper does not specify any particular hardware (e.g., GPU models, CPU types, memory) used for running its experiments.
Software Dependencies No The paper does not provide specific version numbers for software dependencies (e.g., Python, PyTorch, specific libraries or frameworks).
Experiment Setup Yes To ensure a meaningful comparison, we fix the number of queries m = 15 at each iteration for all algorithms. For gradient-based algorithms, ZO-SGD, SCOBO, and our ZO-Rank SGD, we use query points for gradient estimation and 5 points for the line search. ... we set the step size η to 50 and the smoothing parameter µ to 0.01 for Algorithm 1 with line search (where l = 5 and γ = 0.1). ... Both the optimization from human feedback and CLIP similarity score used the same parameters for Algorithm 3: η = 1, µ = 0.1, and γ = 0.5.