--- title: "Alternating optimization" description: > Learn how to get started with alternating optimization in R. output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Alternating optimization} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: ref.bib link-citations: true --- ```{r, setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "ao-" ) library("ao") set.seed(2) ``` The `{ao}` R package implements the alternating optimization (AO) approach. This vignette provides an overview of the package. For theoretical results on AO, refer to: - @bezdek:2002, who explain how AO can avoid getting stuck in local optima - @hu:2002, who show that AO can speed up optimization - @bezdek:2003, who provide more details on the convergence speed of AO ## What actually is alternating optimization? Alternating optimization (AO) is an iterative procedure used to optimize a multivariate function by breaking it down into simpler sub-problems. It involves optimizing over one block of function parameters while keeping the others fixed, and then alternating this process among the parameter blocks. AO is particularly useful when the sub-problems are easier to solve than the original joint optimization problem, or when there is a natural partition of the parameters. Mathematically, consider a real-valued **objective** function $f(\mathbf{x}, \mathbf{y})$ where $\mathbf{x}$ and $\mathbf{y}$ are two **blocks** of function parameters, namely a **partition** of the parameters. The AO procedure can be described as follows: 1. **Initialization**: Start with initial guesses $\mathbf{x}^{(0)}$ and $\mathbf{y}^{(0)}$. 2. **Iterative Steps**: For $k = 0, 1, 2, \dots$ - **Step 1**: Fix $\mathbf{y} = \mathbf{y}^{(k)}$ and solve the sub-problem $$\mathbf{x}^{(k+1)} = \arg \min_{\mathbf{x}} f(\mathbf{x}, \mathbf{y}^{(k)}).$$ - **Step 2**: Fix $\mathbf{x} = \mathbf{x}^{(k+1)}$ and solve the sub-problem $$\mathbf{y}^{(k+1)} = \arg \min_{\mathbf{y}} f(\mathbf{x}^{(k+1)}, \mathbf{y}).$$ 3. **Convergence**: Repeat the iterative steps until a **convergence criterion** is met, such as when the change in the objective function or the parameters falls below a specified threshold, or when a pre-defined iteration limit is reached. The AO procedure can be - viewed as a generalization of joint optimization, where the parameter partition is trivial, consisting of the entire parameter vector as a single block, - also used for maximization problems by simply replacing $\arg \min$ by $\arg \max$ above, - generalized to more than two parameter blocks, i.e., for $f(\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n)$, the procedure involves cycling through each parameter block $\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n$ and solving the corresponding sub-problems iteratively (the parameter blocks do not necessarily have to be disjoint), - **randomized** by changing the parameter partition randomly after each iteration, which can further improve the convergence rate and help avoid getting trapped in local optima [@chib:2010], - run in **multiple threads** for different initial values, parameter partitions, and/or base optimizers. ## Now how to use the `{ao}` package? The `{ao}` package offers the function `ao()`, which can be used to perform different variants of alternating optimization. ### The function call The `ao()` function call with the default arguments looks as follows: ```{r, ao call, eval = FALSE} ao( f, initial, target = NULL, npar = NULL, gradient = NULL, ..., partition = "sequential", new_block_probability = 0.5, minimum_block_number = 2, minimize = TRUE, lower = -Inf, upper = Inf, iteration_limit = Inf, seconds_limit = Inf, tolerance_value = 1e-6, tolerance_parameter = 1e-6, tolerance_parameter_norm = function(x, y) sqrt(sum((x - y)^2)), tolerance_history = 1, base_optimizer = Optimizer$new("stats::optim", method = "L-BFGS-B"), verbose = FALSE, hide_warnings = TRUE ) ``` The arguments have the following meaning: - `f`: The objective function to be optimized. By default, `f` is optimized over its first argument. If optimization should target a different argument or multiple arguments, use `npar` and `target`, [see below](#generalized-objective-functions). Additional arguments for `f` can be passed via the `...` argument as usual. - `initial`: Initial values for the parameters used in the AO procedure. - `gradient`: Optional argument to specify the analytical gradient of `f`. If not provided, a finite-difference approximation will be used. - `partition`: Specifies how parameters are partitioned for optimization. Can be one of the following: - `"sequential"`: Optimizes each parameter block sequentially. - `"random"`: Randomly partitions parameters in each iteration. - `"none"`: No partitioning; equivalent to joint optimization. - Custom partition can be defined using a list of vectors of parameter indices. - `new_block_probability` and `minimum_block_number` are only relevant if `partition = "random"`. In this case, the former controls the probability for creating a new block, and the latter defines the minimum number of parameter blocks. - `minimize`: Set to `TRUE` for minimization problems (default), or `FALSE` for maximization. - `lower` and `upper`: Lower and upper limits for constrained optimization. - `iteration_limit` is the maximum number of AO iterations before termination, while `seconds_limit` is the time limit in seconds. `tolerance_value` and `tolerance_parameter` (in combination with `tolerance_parameter_norm`) specify two other stopping criteria, namely when the difference between the current function value or the current parameter vector and the one before `tolerance_history` iterations, respectively, becomes smaller than these thresholds. - `base_optimizer`: Numerical optimizer used for solving sub-problems. - Set `verbose` to `TRUE` to print status messages, and `hide_warnings` to `FALSE` to show warning messages during the AO process. ### A simple first example The following is an implementation of the [Himmelblau's function](https://en.wikipedia.org/wiki/Himmelblau%27s_function) $$f(x, y) = (x^2 + y - 11)^2 + (x + y^2 - 7)^2:$$ ```{r, himmelblau} himmelblau <- function(x) (x[1]^2 + x[2] - 11)^2 + (x[1] + x[2]^2 - 7)^2 ``` This function has four identical local minima, for example in $x = 3$ and $y = 2$: ```{r, himmelblau evaluate} himmelblau(c(3, 2)) ``` ```{r, visualize_himmelblau, echo = FALSE, message = FALSE, warning = FALSE, fig.align = 'center', out.width = "80%", fig.dim = c(8, 6)} library("ggplot2") x <- y <- seq(-5, 5, 0.1) grid <- expand.grid(x, y) grid$z <- apply(grid, 1, himmelblau) ggplot(grid, aes(x = Var1, y = Var2, z = z)) + geom_raster(aes(fill = z)) + geom_contour(colour = "white", bins = 40) + scale_fill_gradient(low = "blue", high = "red") + theme_linedraw() + labs( x = "x", y = "y", fill = "value", title = "Himmelblau function", subtitle = "the four local minima are marked in green" ) + coord_fixed() + annotate( "Text", x = c(3, -2.8, -3.78, 3.58), y = c(2, 3.13, -3.28, -1.85), label = "X", size = 6, color = "green" ) ``` Minimizing Himmelblau's function through alternating minimization for $\mathbf{x}$ and $\mathbf{y}$ with initial values $\mathbf{x}^{(0)} = \mathbf{y}^{(0)} = 0$ can be accomplished as follows: ```{r, ao himmelblau} ao(f = himmelblau, initial = c(0, 0)) ``` Here, we see the output of the alternating optimization procedure, which is a `list` that contains the following elements: - `estimate` is the parameter vector at termination. - `value` is the function value at termination. - `details` is a `data.frame` with full information about the procedure: For each iteration (column `iteration`) it contains the function value (column `value`), parameter values (columns starting with `p` followed by the parameter index), the active parameter block (columns starting with `b` followed by the parameter index, where `1` stands for a parameter contained in the active parameter block and `0` if not), and computation times in seconds (column `seconds`). - `seconds` is the overall computation time in seconds. - `stopping_reason` is a message why the procedure has terminated. ### Using the analytical gradient For the Himmelblau's function, it is straightforward to define the analytical gradient as follows: ```{r, himmelblau gradient} gradient <- function(x) { c( 4 * x[1] * (x[1]^2 + x[2] - 11) + 2 * (x[1] + x[2]^2 - 7), 2 * (x[1]^2 + x[2] - 11) + 4 * x[2] * (x[1] + x[2]^2 - 7) ) } ``` The gradient function will be used by `ao()` if defined through the `gradient` argument as follows: ```{r, ao himmelblau with gradient, eval = FALSE} ao(f = himmelblau, initial = c(0, 0), gradient = gradient) ``` The output is not shown here because it closely resembles the previous example, where the gradient was not specified and thus a finite-difference approximation was employed. However, in scenarios involving higher dimensions, utilizing the analytical gradient can notably improve both the speed and stability of the process. ### Random parameter partitions Another version of the AO procedure involves using a new, random partition of the parameters in every iteration. This approach can enhance the convergence rate and prevent being stuck in local optima. It is activated by setting `partition = "random"`. The randomness can be adjusted using two parameters: - `new_block_probability` determines the probability for creating a new block when building a new partition. Its value ranges from `0` (no blocks are created) to `1` (each parameter is a single block). - `minimum_block_number` sets the minimum number of parameter blocks for random partitions. By default, it is configured to `2` to avoid generating trivial partitions. The random partitions are build as follows:^[`Procedure` is an internal R6 object [@chang:2022].] ```{r, partition demo} procedure <- ao:::Procedure$new( npar = 10, partition = "random", new_block_probability = 0.5, minimum_block_number = 2 ) procedure$get_partition() procedure$get_partition() ``` As an example of AO with random partitions, consider fitting a two-class Gaussian mixture model via maximizing the model's log-likelihood function $$\ell(\boldsymbol{\theta}) = \sum_{i=1}^n \log\Big( \lambda \phi_{\mu_1, \sigma_1^2}(x_i) + (1-\lambda)\phi_{\mu_2,\sigma_2^2} (x_i) \Big),$$ where the sum goes over all observations $1, \dots, n$, $\phi_{\mu_1, \sigma_1^2}$ and $\phi_{\mu_2, \sigma_2^2}$ denote the normal density for the first and second cluster, respectively, and $\lambda$ is the mixing proportion. The parameter vector to be estimated is thus $\boldsymbol{\theta} = (\mu_1, \mu_2, \sigma_1, \sigma_2, \lambda)$. As there exists no closed-form solution for the maximum likelihood estimator $\boldsymbol{\theta}^* = \arg\max_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta})$, we need numerical optimization for finding the function optimum. The model is fitted to the following data:^[The `faithful` data set contains information about eruption times (`eruptions`) of the Old Faithful geyser in Yellowstone National Park, Wyoming, USA. The data histogram hints at two clusters with short and long eruption times, respectively. For both clusters, we assume a normal distribution, such that we consider a mixture of two Gaussian densities for modeling the overall eruption times.] ```{r, visualize_faithful, echo = FALSE, message = FALSE, warning = FALSE, fig.align = 'center', out.width = "80%", fig.dim = c(8, 6)} library("ggplot2") ggplot(datasets::faithful, aes(x = eruptions)) + geom_histogram(aes(y = after_stat(density)), bins = 30) + xlab("eruption time (min)") ``` The following function calculates the log-likelihood value given the parameter vector `theta` and the observation vector `data`:^[We restrict the standard deviations `sd` to be positive (via the exponential transformation) and `lambda` to be between 0 and 1 (via the logit transformation).] ```{r, normal mixture} normal_mixture_llk <- function(theta, data) { mu <- theta[1:2] sd <- exp(theta[3:4]) lambda <- plogis(theta[5]) c1 <- lambda * dnorm(data, mu[1], sd[1]) c2 <- (1 - lambda) * dnorm(data, mu[2], sd[2]) sum(log(c1 + c2)) } ``` The `ao()` call for performing alternating *maximization* with random partitions looks as follows, where we simplified the output for brevity: ```{r, ao normal mixture} out <- ao( f = normal_mixture_llk, initial = runif(5), data = datasets::faithful$eruptions, partition = "random", minimize = FALSE ) round(out$details, 2) ``` ### More flexibility The `{ao}` package offers some flexibility for performing AO.^[Do you miss a functionality? Please let us know via an [issue on GitHub](https://github.com/loelschlaeger/ao/issues/new?assignees=&labels=future&projects=&template=suggestion.md).] #### Generalized objective functions Optimizers in R generally require that the objective function has a single target argument which must be in the first position. `{ao}` allows for optimization over an argument other than the first, or more than one argument. For example, say, the `normal_mixture_llk` function above has the following form and is supposed to be optimized over the parameters `mu`, `sd`, and `lambda`: ```{r, normal mixture type 2} normal_mixture_llk <- function(data, mu, sd, lambda) { sd <- exp(sd) lambda <- plogis(lambda) c1 <- lambda * dnorm(data, mu[1], sd[1]) c2 <- (1 - lambda) * dnorm(data, mu[2], sd[2]) sum(log(c1 + c2)) } ``` In `ao()`, this scenario can be specified by setting - `target = c("mu", "sd", "lambda")` (the names of the target arguments) - and `npar = c(2, 2, 1)` (the lengths of the target arguments): ```{r, ao normal mixture type 2, results = "hide"} ao( f = normal_mixture_llk, initial = runif(5), target = c("mu", "sd", "lambda"), npar = c(2, 2, 1), data = datasets::faithful$eruptions, partition = "random", minimize = FALSE ) ``` #### Parameter bounds Instead of using parameter transformations in the `normal_mixture_llk()` function above, parameter bounds can be directly specified in `ao()` via the arguments `lower` and `upper`, where both can either be a single number (a common bound for all parameters) or a vector of specific bounds per parameter. Therefore, an more straightforward implementation of the mixture example would be: ```{r, normal mixture type 3, results = "hide"} normal_mixture_llk <- function(mu, sd, lambda, data) { c1 <- lambda * dnorm(data, mu[1], sd[1]) c2 <- (1 - lambda) * dnorm(data, mu[2], sd[2]) sum(log(c1 + c2)) } ao( f = normal_mixture_llk, initial = runif(5), target = c("mu", "sd", "lambda"), npar = c(2, 2, 1), data = datasets::faithful$eruptions, partition = "random", minimize = FALSE, lower = c(-Inf, -Inf, 0, 0, 0), upper = c(Inf, Inf, Inf, Inf, 1) ) ``` #### Custom parameter partition `{ao}` allows for the specification of custom parameter partitions. For example, say, the parameters of the Gaussian mixture model are supposed to be grouped by type: $$\mathbf{x}_1 = (\mu_1, \mu_2),\ \mathbf{x}_2 = (\sigma_1, \sigma_2),\ \mathbf{x}_3 = (\lambda).$$ In `ao()`, custom parameter partitions can be specified by setting `partition = list(1:2, 3:4, 5)`, i.e. by defining a `list` where each element corresponds to a parameter block, containing a vector of parameter indices. Parameter indices can be members of any number of blocks. #### Stopping criteria Currently, four different stopping criteria for the AO procedure are implemented: 1. a predefined iteration limit is exceeded (via the `iteration_limit` argument) 2. a predefined time limit is exceeded (via the `seconds_limit` argument) 3. the absolute change in the function value in comparison to the last iteration falls below a predefined threshold (via the `tolerance_value` argument) 4. the change in parameters in comparison to the last iteration falls below a predefined threshold (via the `tolerance_parameter` argument, where the parameter distance is computed via the norm specified as `tolerance_parameter_norm`) Any number of stopping criteria can be activated or deactivated^[Stopping criteria of the AO procedure can be deactivated, e.g., by setting `iteration_limit = Inf`, `seconds_limit = Inf`, `tolerance_value = 0`, or `tolerance_parameter = 0`.], and the final output contains information about the criterium that caused termination. #### Optimizer for solving the sub-problems By default, the L-BFGS-B algorithm [@byrd:1995] implemented in `stats::optim` is used. for solving the sub-problems numerically. However, any other optimizer can be selected by specifying the `base_optimizer` argument. Such an optimizer must be defined through the framework provided by the `{optimizeR}` package, please see [its documentation](https://loelschlaeger.de/optimizeR/) for details. For example, the `stats::nlm` optimizer can be selected by setting `base_optimizer = Optimizer$new("stats::nlm")`. #### Multiple threads Alternating optimization can suffer from local optima. To increase the likelihood of reaching the global optimum, users can specify - multiple starting parameters, - multiple parameter partitions, - multiple base optimizers. Use the `initial`, `partition`, and/or `base_optimizer` arguments to provide a `list` of possible values for each parameter. Each combination of initial values, parameter partitions, and base optimizers will create a separate alternating optimization thread: ```{r, normal mixture type 4} normal_mixture_llk <- function(mu, sd, lambda, data) { c1 <- lambda * dnorm(data, mu[1], sd[1]) c2 <- (1 - lambda) * dnorm(data, mu[2], sd[2]) sum(log(c1 + c2)) } out <- ao( f = normal_mixture_llk, initial = list(runif(5), runif(5)), target = c("mu", "sd", "lambda"), npar = c(2, 2, 1), data = datasets::faithful$eruptions, partition = list("random", "random", "random"), minimize = FALSE, lower = c(-Inf, -Inf, 0, 0, 0), upper = c(Inf, Inf, Inf, Inf, 1) ) out$values ``` In the case of multiple threads, the output changes slightly in comparison to the standard case. It is a `list` with the following elements: - `estimate` is the optimal parameter vector over all threads. - `estimates` is a `list` of optimal parameters in each thread. - `value` is the optimal function value over all threads. - `values` is a `list` of optimal function values in each thread. - `details` combines details of the single threads and has an additional column `thread` with an index for the different threads. - `seconds` gives the computation time in seconds for each thread. - `stopping_reason` gives the termination message for each thread. - `threads` give details how the different threads were specified. By default, threads run sequentially. However, since they are independent, they can be parallelized. To enable parallel computation, use the [`{future}` framework](https://future.futureverse.org/). For example, run the following *before* the `ao()` call: ```r future::plan(future::multisession, workers = 4) ``` When using multiple threads, setting `verbose = TRUE` to print tracing details during alternating optimization is not supported. However, progress of threads can still be tracked using the [`{progressr}` framework](https://progressr.futureverse.org/). For example, run the following *before* the `ao()` call: ```r progressr::handlers(global = TRUE) progressr::handlers( progressr::handler_progress(":percent :eta :message") ) ``` ## References