bayesian optimization

An interactive demo of Bayesian optimization implemented from scratch using only NumPy math operations. This uses Gaussian process regression and an acquisition function to balance sampling high-value vs. high-uncertainty regions.

initializing…

controls

18
0.02 (wiggly)0.50 (smooth)
0.14.0
0.001 (trust observations)0.500 (noisy)
0.5 (exploit)5.0 (explore)
acquisition:
samples0
best y
best x
next proposal
step0
GP posterior — mean ± 2σ, observed samples, and next proposed point
acquisition function — maximum marks the next proposed point

what is bayesian optimization?

Bayesian optimization is a strategy for finding the maximum of an unknown function that is expensive to evaluate, using as few evaluations as possible. Instead of brute-force sampling, it builds a probabilistic model of what the function may be and uses an acquisition function to decide where to sample next.

1

place a prior

Start with a Gaussian process, a probability distribution over functions encoding our prior beliefs about smoothness and scale.

2

observe

Evaluate the true function at a proposed point. The GP updates its beliefs (posterior) to reflect the new data.

3

acquire

An acquisition function scores every candidate point by trading off high predicted mean (exploitation) against high uncertainty (exploration).

4

repeat 2-3

Repeat as many times as we want by sampling at the acquisition function's maximum, then updating the GP for the new point.

gaussian process regression

Gaussian Process regression uses a kernel with given parameters to establish the prior (the set of functions we are considering, maybe lumpier or smoother depending on the parameters) and narrows it down based on new data. Given two points, the kernel aims to express how correlated they are with each other. For this demo, I used a radial basis function (RBF) kernel, AKA a Gaussian kernel.

RBF kernel

k(x, x') = σ²f · exp( −‖x − x'‖² / (2 ℓ²) )

Controls how correlation decays as distance increases. Higher makes our prior a set of smoother, more slowly varying functions.

posterior equations

We get the mean and variance of the posterior at a given (testing) point x* as follows, with X as our existing (training) data.

K = k(X, X) + σ²n · I ← training covariance + noise k* = k(X, x*) ← train–test covariance k** = k(x*, x*) ← prior variance at x* μ(x*) = k*ᵀ K⁻¹ y ← posterior mean σ²(x*) = k** − k*ᵀ K⁻¹ k* ← posterior variance

I implemented matrix inversion via Cholesky decomposition for more efficiency.

acquisition functions

Our acquisition function balances sampling points that we know have high values (exploitation) with sampling points with lots of uncertainty (exploration). For instance, say we're looking for oil: drilling is costly, but we want to find areas with lots of oil, so we have to make tradeoffs between drilling in places we know have a lot of oil and trying places where we don't yet have enough information to know whether there's oil (high-uncertainty areas). For this demo, I used two different acquisition functions for the user to choose from.

upper confidence bound (UCB)

a(x) = μ(x) + κ · σ(x)

This acquisition function is just a weighted sum. κ directly controls the tradeoff between exploration and exploitation; higher κ favors exploration.

expected improvement (EI)

Z = (μ(x) − f⁺ − ξ) / σ(x) a(x) = (μ(x) − f⁺ − ξ) · Φ(Z) + σ(x) · φ(Z)

This function measures how much improvement over the current best value f⁺ we expect. Φ and φ are standard normal CDF and PDF; ξ is an exploration bonus.