k-means clustering approach to an integral evaluation

k-means clustering approache to calculating multi-dimensional integrals as an alternative to Monte Carlo.
Author

Yevgen Ryeznik

Published

June 14, 2019

Problem formulation

Let us consider an integral

(1)I=Ωf(x)π(x)dx,

where f(x):ΩRn, π(x):ΩRn is a probability density function, xΩRn.

Using Monte Carlo (MC) approach, the integral can be evaluates as

(2)I1Ni=1Nf(xi),

where {xi}i=1N are sample points from the distribution π(x). The lager the sample size N, the more accurate the approximation. However, if N is sufficiently large, then the integral evaluation procedure becomes time-consuming.

K-means clustering approach to integration

An alternative approach can be applied, which is base on the k-means clustering. The algorithm is the following:

  1. Sample N points from π(x) distribution.
  2. Apply k-means clustering to the sample obtained. An output of the clustering procedure is:
    • {Xj(0)}j=1K – clusters’ centers.
    • {wj}j=1K – proportions of sample points in jthe cluster.
  3. Calculate an approximated value of the integral as

(3)Ij=1Kwjf(Xj(0)).

R code implementing k-means-based algorithm

# 
# fcn -- function to be integrated.
#   X -- array (N x n) of sample points from a corresponding 
#        distribution.
#   K -- number of clusters.
integrate_kmeans <- function(fcn, X, K){
  km <- kmeans(X,  K)         # we use R function kmeans
  Xc <- km$centers            # centers of obtained clusters
   w <- km$size/sum(km$size)  # proportions of sample points in clusters
  
  F <- purrr::map_dbl(seq_len(K), ~ fcn(Xc[., ])) 
  I <- sum(w*F)
  
  return(I)
}

R code implementing Monte Carlo-based algorithm

# fcn -- function to be integrated.
#   X -- array (N x n) of sample points from a corresponding 
#        distribution.
integrate_mc <- function(fcn, X){
  F <- purrr::map_dbl(seq_len(K), ~ fcn(X[., ])) 
  I <- mean(F)
  
  return(I)
}

An Example

Let us evaluate a two-dimensional integral

(4)I=x12+x224e(x12+x22)4dx1dx2,

which can be rewritten as

(5)I=4πχ(x1,x2)14πe(x12+x22)4dx1dx2==f(x1,x2)π(x1,x2)dx1dx2,

where

f(x1,x2)=4πχ(x1,x2)={4π,x12+x2240,otherwise,

and

π(x)=14πe(x12+x22)4

is a probability density function of a normal distribution N(μ=(00),Σ=(2002)).

fcn <- function(x){
  result <- 4*pi*ifelse(x[1]^2+x[2]^2 <= 4, 1, 0)
  return(result)
}

n <- 2      # number of dimensions
N <- 10000  # sample size from normal distribution
K <- 8      # number of clusters for k-means clustering

# sampling from N(mu, Sigma)
mu <- c(0, 0)
Sigma <- diag(c(2, 2))

set.seed(3141592)
X <- MASS::mvrnorm(N, mu, Sigma)

# integration by using MC
mc_value <- integrate_mc(fcn, X)

# integration by using k-means
kmeans_value <- integrate_kmeans(fcn, X, K)

# true value
true_value <- 4*pi*(1-exp(-1))

tibble::tibble(
  true_value, mc_value, kmeans_value
)                    
# A tibble: 1 × 3
  true_value mc_value kmeans_value
       <dbl>    <dbl>        <dbl>
1       7.94     7.85         7.73