Zeroth-Order Optimization Meets Human Feedback: Provable Learning via Ranking Oracles
Authors: Zhiwei Tang, Dmitry Rybin, Tsung-Hui Chang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this study, we delve into an emerging optimization challenge involving a blackbox objective function that can only be gauged via a ranking oracle a situation frequently encountered in real-world scenarios, especially when the function is evaluated by human judges. ... Throughout experiments, we found that ZO-Rank SGD can significantly enhance the detail of generated images with only a few rounds of human feedback. |
| Researcher Affiliation | Academia | Zhiwei Tang1,3 , Dmitry Rybin2, Tsung-Hui Chang1,3 School of Science and Engineering1, School of Data Science2 The Chinese University of Hong Kong, Shenzhen, China Shenzhen Research Institute of Big Data3, Shenzhen, China |
| Pseudocode | Yes | Algorithm 1 ZO-Rank SGD; Algorithm 2 Line search strategy for gradient-based optimization algorithms; Algorithm 3 Modified ZO-Rank SGD algorithm for optimizing latent embeddings of Stable Diffusion. |
| Open Source Code | No | The paper does not include an explicit statement about releasing code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | Specifically, we adopt a similar experimental setup as (Cai et al., 2022; Duan et al., 2016), where the goal is to learn a policy for simulated robot control with several problems from the Mu Jo Co suite of benchmarks (Todorov et al., 2012). |
| Dataset Splits | No | The paper does not explicitly provide training/validation/test dataset splits with percentages or sample counts. While it mentions tuning hyperparameters, it does not specify data partitioning for these processes. |
| Hardware Specification | No | The paper does not specify any particular hardware (e.g., GPU models, CPU types, memory) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies (e.g., Python, PyTorch, specific libraries or frameworks). |
| Experiment Setup | Yes | To ensure a meaningful comparison, we fix the number of queries m = 15 at each iteration for all algorithms. For gradient-based algorithms, ZO-SGD, SCOBO, and our ZO-Rank SGD, we use query points for gradient estimation and 5 points for the line search. ... we set the step size η to 50 and the smoothing parameter µ to 0.01 for Algorithm 1 with line search (where l = 5 and γ = 0.1). ... Both the optimization from human feedback and CLIP similarity score used the same parameters for Algorithm 3: η = 1, µ = 0.1, and γ = 0.5. |