GREY-BOX BAYESIAN OPTIMIZATION: IMPROVING PERFORMANCE BY LOOKING INSIDE THE BLACK-BOX A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Saul Toscano Palmerin May 2020 © 2020 Saul Toscano Palmerin ALL RIGHTS RESERVED GREY-BOX BAYESIAN OPTIMIZATION: IMPROVING PERFORMANCE BY LOOKING INSIDE THE BLACK-BOX Saul Toscano Palmerin, Ph. Cornell University 2020 Non-convex time-consuming objectives are often optimized using black-box opti- mization. These approaches assume very little about the objective. While broadly ap- plicable, these approaches typically require more evaluations than methods exploiting more problem structure.
In particular, often, we can acquire information about the ob- jective function in ways other than direct evaluation, which is less time-consuming than evaluating the objective directly. This allows us to develop novel Bayesian optimization algorithms that outperform methods that rely only objective function evaluations. In this thesis, we consider three problems: optimization of sum and integrals of expensive-to- evaluate integrands; optimizing hyperparameters for iteratively trained supervised learn- ing machine learning algorithms; and optimizing non-convex functions with a new effi- cient multistart stochastic gradient descent algorithm. BIOGRAPHICAL SKETCH Saul Toscano Palmerin is a PhD student in the School of Operations Research and In- formation Engineering (ORIE) of Cornell University.
His research focuses on the de- velopment and analysis of Bayesian optimization algorithms, and their application for training machine learning algorithms and optimization via simulation. Previously, from 2016 to 2017, he did a one year internship at Uber as a data scientist, where he worked on pricing and transportation problems. Before joining Cornell, he received a B. de- gree in Mathematics by the Center for Mathematical Research (CIMAT) in Mexico.
iii To my father, Miguel Angel Toscano Medina, who supports me every second of my life. iv ACKNOWLEDGEMENTS First, I would like to thank my advisor, Professor Peter Frazier, for his support, help and guidance. He has been a fundamental mentor, who has taught me the beauty of operations research. I would also like to thank everyone at Cornell in general, and particularly within my department of Operations Research and Information Engineering, for the fantastic intellectual atmosphere they create.
I would particularly like to thank Shane Henderson, David Bindel, and all other faculty for their support. I am grateful to my parents, Marisol Palmerin Cerna and Miguel Angel Toscano Medina, for their love and support, and for teaching me that life is beautiful and in colors. Finally, I thank my wife, Anna Pantielieieva, who is my home. v TABLE OF CONTENTS Biographical Sketch.
v Table of Contents. vi List of Tables. viii List of Figures .1 Gaussian Process Regression .1 Choosing a Mean Function and Kernel. 6 2 Bayesian Optimization with Expensive Integrands 8 2.2 Conceptual Description of the BQO Algorithm .3 Computation of the BQO Algorithm .1 Preliminary Representation of the Value of Information .2 Discretization-Free Computation of the Value of Information and its Gradient .3 Computation and Complexity of the BQO Algorithm .4 Discretized Computation of the Value of Information and its Gradient .4 Asymptotic Analysis for BQO .1 An Analytic Test Problem .2 New York City’s Citi Bike System .3 Optimal Design of an Aircraft .4 Newsvendor Problem under Dynamic Consumer Substitution .6 Problems Simulated from Gaussian Process Priors.
53 3 Practical Multi-fidelity Bayesian Optimization for Hyperparameter Tuning 55 3.2 THE taKG AND taKG0/ ACQUISTION FUNCTIONS .2 Valuing Trace Observations .3 Trace-aware Knowledge Gradient (taKG) .5 Efficiently maximizing taKG and taKG0/ .6 Warm-starting from Partial Runs .7 Batch and Derivative Evaluations .1 Optimizing Synthetic Functions .2 Optimizing Hyperparameters of Neural Nets .3 Optimizing Hyperparameters for Large-scale Kernel Learning. 78 4 Effort Allocation and Statistical Inference for Multistart Stochastic Gradi- ent Descent 79 4.2 The SGD-GP Statistical Model .1 Inference Over M(n) in One-Dimension Given Hyperparameter θ 87 4.2 Inference Over M(n) in Multiple Dimensions, Marginalizing over Hyperparameter θ .3 The Most Likely To Succeed Allocation Rule .1 A Concave Objective Function .2 An Objective Function with Many Local Maxima .3 Objective Function with a Vanishing Gradient .4 The 20-Dimensional Rosenbrock Function. 101 A Proofs for Chapter 2 102 A.1 Proofs of Results in Section 2.2 BQO’s Time and Space Complexity .3 Closed-Form Expressions for the Gaussian and Squared Exponential Kernel Case .4 Illustration of Poor Performance of the Multi-Task Algorithm .5 Consistency of BQO .1 Consistency of BQO for Finite Domains .2 Consistency of BQO for Continuum Domains. 128 B Proofs and Experiments Details for Chapter 3 151 B.2 GPs for Hyperparameter Optimization .3 Additional experimental details .2 Real-world experiments.
158 C Appendix for Chapter 4 159 C.1 SGD Convergence Theorems. 159 vii LIST OF TABLES 2.1 Table of Notation.2 Probability distribution of w = (x2 , x3 ) for the Branin problem from §2. 49 viii LIST OF FIGURES 2.1 Illustration of the BQO algorithm on an analytic test problem after eval- uating F at points chosen uniformly at random in an initial phase of training and n = 9 points chosen by BQO.2 Illustration of a traditional Bayesian optimization algorithm in the same problem setting as Figure 2. The algorithm pictured is the knowledge gradient (KG) method [Frazier et al.
This algorithm evalu- ates G, unlike BQO’s evaluations of F. As a consequence, it tends to provide lower-quality estimates of G within a given sampling budget.3 Performance comparison between BQO and two Bayesian optimization benchmark, the KG and EI methods, on the analytic test problem (2.2, as described in §2.4 Performance results for the Citi Bike problem (plot a), and a screenshot from our simulation of the Citi Bike problem (plot b), as described in §2.5 Performance results for the aircraft problem §2.6 Performance results for the newsvendor problem with dynamic cus- tomer substitution §2. BQO quickly finds initial inventory levels that provide substantially higher profit than competing methods.7 Performance comparison between BQO, the SDE algorithm [Williams et al., 2000], and the multi-task algorithm [Swersky et al., 2013] on the Branin problem from §2.8 Normalized performance difference between BQO and KG in problems simulated from a Gaussian process, as a function of β , which measures how quickly F(x, w) varies with w, the approximate variance reduction ratio A, and the overall number of samples. BQO outperforms KG over most of the parameter space, and is approximately 10 times better when β is near exp(4).1 Optimizing synthetic functions: Plots show simple regret over 40 independent runs for synthetic functions with trace observations and one or two continuous fidelity controls for 2-d Branin, 3-d Rosenbrock, 3-d Hartmann, and 6-d Hart- mann problems.
q indicates batch size for fixed batch-size methods. taKG0 outperforms competitors in both sequential and batch settings.2 We show the validation error for tuning feedforward neural networks on MNIST (each with 20 runs); tuning convolutional neural networks on CIFAR- 10 and SVHN (each with 10 runs); for KISS-GP kernel learning we show -log marginal likelihood divided by the number of datapoints. q indicates batch size for fixed batch-size methods. taKG0/ outperforms competitors in both se- quential and batch settings.1 The SGD-GP statistical model (b) and MLS allocation rule (c) on the problem (a) from §4.
(b) shows that SGD-GP can predict the start’s limiting objective value with high precision after 10 iterations. (c) shows that our policy finds the optimal solution faster than the equal allocation policy.2 The SGD-GP statistical model (b) and MLS allocation rule (c) on the problem (a) from §4. (b) shows that SGD-GP can predict the start’s limiting objective value with high precision after 20 iterations. (c) shows that our policy does better than the other policies after only 8 iterations.3 The SGD-GP statistical model (b) and MLS allocation rule (c) on the problem (a) from §4.
(b) shows that SGD-GP can predict a start’s limiting objective value with high precision after 15 iterations. (c) shows that our policy does better than the other policies after 10 it- erations.4 Performance comparison between MLS , equal allocation, and random allocation using 30 starting points. Our policy does much better than the other policies after only 45 iterations. After only 100 iterations, our policy finds the global optimum while the other policies are far away from a good solution.
This suggests MLS works well on high- dimensional problems where the objective function is noisy. 100 x CHAPTER 1 INTRODUCTION Bayesian optimization algorithms effectively find approximate global optima of non-convex derivative-free time-consuming or expensive-to-evaluate objective func- tions (black-boxes). These objective functions appear when tuning a machine learning algorithm's hyperparameters in deep neural networks [Snoek et al., 2012], designing aircraft [Liem et al., 2014], and choosing parameters in ride-sharing dispatch systems. Bayesian optimization algorithms use Gaussian process regression to build a surrogate for the objective, and an acquisition function, often based on value of information anal- ysis, to choose points at which to evaluate the objective function.
In a variety of important problems, computing the objective function requires per- forming a sequence of steps, and the results from each step provides information about the objective function. While typically time-consuming, performing these steps is less time-consuming than direct evaluations. This opens an opportunity to outperform meth- ods that rely solely on direct objective function evaluations through intelligent choose which steps to perform. In specific, we can use a Bayesian optimization approach, where we use Bayesian statistics to infer the objective function, and value of information anal- ysis to choose points at which to perform those steps.
We refer to this approach as “grey-box” Bayesian optimization. In this thesis, we consider three settings. In chapter 2, based on the papers Toscano-Palmerin and Frazier [2018c, 2016], we consider non-convex derivative-free time-consuming (or “expensive”) objectives that are the sum or integral of a larger number of less time-consuming objectives. These objec- tives arise in designing aircraft, tuning parameters of simulators, and machine learning algorithm's hyperparameters.
We propose a new Bayesian optimization algorithm that leverages this structure to improve performance. Our proposed Bayesian optimization 1 method is average-case optimal by construction when a single evaluation of the inte- grand remains within our evaluation budget, and consistent for objective functions that are sums. In numerical experiments comparing against previous state-of-the-art meth- ods, including those that also leverage sum or integral structure, our method performs as well or better across a wide range of problems and offers significant improvements when evaluations are noisy or the integrand varies smoothly in the integrated variables. In chapter 3, based on the paper Wu et al.
[2019], we consider the problem of tuning hyperparameters for iteratively trained supervised learning models, such as deep neu- ral networks. In these settings, when choosing some hyperparameters to optimize, we can approximate their optimal solution by choosing the training size, validation size, and number of training iterations. By consider approximations, an algorithm can use low-fidelity evaluations to quickly identify a smaller set of promising hyperparameters, and then later focus on more expensive high-fidelity evaluations within this set to refine its estimates. We develop a new efficient Bayesian optimization algorithm that lever- ages the use of fidelity evaluations, and that we observe a full trace of performance with respect to training iterations rather than just a single performance value at the chosen fidelity.
Numerical experiments show that our method outperforms state-of-the-art al- gorithms when tuning feedforward neural networks on MNIST, tuning convolutional neural networks on CIFAR-10 and SVHN, and in large-scale kernel learning. In chapter 4, based on the working journal paper Toscano-Palmerin and Frazier [2019] and Toscano-Palmerin and Frazier [2018b], we propose an allocation rule for multistart stochastic gradient descent methods for non-convex optimization. While these methods are effective for global optimization, they seem to waste computational re- sources: starts often converge to local optima or stationary points that are the same or worse than those found by other starts, failing to produce useful information. We de- 2 velop a rule for allocation computational effort across starts, which uses computation more efficiently by allocation more resources to the most promising starts.