Author(s): Kaitai Dong Originally published on Towards AI. Figure 1: Gaussian mixture model illustration [Image by AI] Introduction In a time where deep learning (DL) and transformers steal the spotlight, it’s easy to forget about classic algorithms like K-means, DBSCAN, and GMM. But here’s a hot take: for anyone tackling real-world clustering and anomaly detection challenges, these statistical workhorses remain indispensable tools with surprising staying power. Consider the everyday clustering puzzles: customer segmentation, social network analysis, or image segmentation. K-means has been used to solve these problems for decades with its simple centroid-based approach. When data forms irregular shapes, DBSCAN steps in with its density-based algorithm to identify non-convex clusters that leave K-means bewildered. But real-world data rarely forms neat, separated bubbles. Enter Gaussian Mixture Model and its variants! GMMs acknowledge the fundamental uncertainty in cluster assignments. By modeling the probability density of normal behavior, they can identify observations that don’t fit the expected pattern without requiring labeled examples. So before chasing the latest neural architecture for your clustering or segmentation task, consider the statistical classics such as GMMs. Many people can confidently talk about how K-means works but I bet my good dollars that not many have that confidence when it comes to GMMs. This article will discuss the math behind GMM and its variants in an understandable way (I will try my best!), and showcase why it deserves more attention for your next clustering tasks. Remember this. Classics never make a comeback. They wait for that perfect moment to take the spotlight from overdone, tired trends. What is a Gaussian Mixture Model? A Gaussian mixture is a function that is composed of several Gaussian distributions, each identified by k ∈ {1,…, K}, where K is the number of clusters of our dataset, which you must know in advance. Each Gaussian distribution k in the mixture contain the following parameters: A mean μ that defines its center. A covariance matrix Σ that describes its shape and orientation. This would be equivalent to the dimensions of an ellipsoid in a multivariate scenario. A mixing coefficient π that defines the weight of the Gaussian function, where π ≥ 0 and the sum of π for each k adds up to 1. Mathematically, it can be written in: where p(x) represents the probability density at point x, and N(x|μ, Σ) is the multivariate Gaussian density with mean μ and covariance matrix Σ. Equations all look scary. But let’s take a step back and look at this multivariate Gaussian density function N(x|μ, Σ) and the dimension of each parameter in it. Assume the dataset include N = 500 three-dimensional data points (D=3), then the dataset x is essentially a 500 × 3 matrix, μ is a 1 × 3 vector, and Σ is a 3 × 3 matrix. The output of the Gaussian distribution function will be a 500 × 1 vector. When working with a GMM, we face a circular problem: To know which Gaussian cluster each data point belongs to, we need to know the parameters of each Gaussian (means, covariances, weights). But to estimate these parameters correctly, we need to know which data points belong to each Gaussian. To break this cycle, here enters the Expectation-Maximization (EM) algorithm, where it makes educated guesses and then refines them iteratively. Parameter estimation with EM algorithm EM algorithm helps determine the optimal values for the parameters of a GMM through the following steps: Step 1: Initialize — Start with random guesses for parameters (μ, Σ, π) of each Gaussian cluster. Step 2: Expectation (E-step) — Calculate how much each point “belongs” to each Gaussian cluster and then compute a set of “responsibilities” for each data point, which represents the probabilities that the data point comes from each cluster. Step 3: Maximization (M-step) — Update each Gaussian cluster using all the instances in the dataset, with each instance weighted by the estimated probability (a.k.a. responsibilities) that it belongs to that cluster. Specifically, new means are the weighted average of all data points, where weights are the responsibilities. New covariances are the weighted spread around each new mean. Finally, new mixing weights are the fraction of the total responsibilities each component receives. Note that each cluster’s update will mostly be impacted by the instances it is most responsible for. Step 4: Repeat — Go back to Step 2 with these updated parameters and continue until the changes become minimal (convergence). Often, people get confused with the M-step as a lot of terms are thrown in. I will use the previous example (500 3-D data points with 3 Gaussian clusters) to break it down into more concrete terms. For updating the means, we’re doing a weighted average where each point contributes according to its responsibility value to the corresponding Gaussian cluster. Mathematically, for kth Gaussian cluster, new means = (sum of [responsibility_ik × point_i]) / (sum of all responsibilities for cluster k) For updating the covariances, we use a similar weighted approach. For each point, calculate how far it is from the new mean, and then multiply this deviation by its transpose to get a matrix. Subsequently, weight this matrix by the point’s responsibility and sum these weighted matrices across all points. Finally, divide it by the total responsibility for that cluster. For updating the mixing weights, we simply sum up all the responsibilities for cluster k and then divide by the total number of data points. Let’s say for Gaussian cluster 2: The sum of responsibilities is 200 (out of 500 points) The weighted sum of points is (400, 600, 800) The weighted sum of squared deviations gives a certain covariance matrix Then: New mean for cluster 2 = (400, 600, 800)/200 = (2, 3, 4) New mixing weight = 200/500 = 0.4 New covariance = (weighted sum of deviations)/200 Hopefully it makes a lot more sense now!! Clustering with GMM Now that I have an estimate of the location, size, shape, orientation, and relative weights of each Gaussian cluster, GMM can easily assign […]
↧