Gaussian Mixture Model is Not Mixed Membership Model
When we think of GMM, we often confuse about conditional probability with membership. In GMM, each observation \(x_i\) is associated with a latent membership \(z_i\). Notice here I used lowercases letter to indicate this is an value not a random variable. However, people often refer to the “responsibility”, i.e. \(P(Z = k \| X = x_i), k = 1, ..., K\), as if the observation \(x_i\) actually has a mixed membership: for group \(k\), \(x_i\) has a membership with proportion \(P(Z = k \| X = x_i)\).
This is wrong. The random variable \(X\) has a finite mixture of Gaussian distribution,
\[{X \sim \sum_{k=1}^K p_k N_k}.\]But for the actual observation \(x_i\), it is associated with a membership from one of the Gaussian mixture components. To be specific:
- Marginally \(X \sim \sum_{k=1}^K p_k N_k\),
- Conditioned on \(Z=z\), we have \(X \|z \sim N_z(\mu,\sigma)\),
- Each sample is a \((x_i,z_i)\). From a generative model perspective, each sample is generated from two steps:
- Sample a \(z\) from a distribution, e.g. Multinomial(\(p\)).
- Sample a \(x\) from \(X\|Z=z\).
This is what forms the Gibbs sampling scheme for estimation the posterior distribution \(P(Z,\theta \| x_1,...,x_n)\):
- Sample \(z_i\) from \(p(z_i \| \theta, x_i)\), for each \(i\),
- Sample \(\theta_j\) from \(p(\theta_j \| z_1,...,z_n,x_1,...,x_n)\), for each \(j\).
Think of a simple example. Suppose you have a group of people’s height data. We assume the height is a mixture of two population’s heights, male and female. For one person we observe height equal 5’, and he/she must belongs to exactly one of the two components, not a mixture of two. But if we ask a different question: what is the probability of someone being male if height is 5’? Then we have the answer from the responsibility, \(P(Z = male\|X = 5)\).
The lesson to learn here is that when you fit a Gaussian Mixture Models to a set of data, and then you want to estimate the latent membership. Because each \(x_i\) is one Monte Carlo sample from the mixture distribution, by definition of the generative model, it must have one membership. It is indeed a hard clustering, not a soft clustering. For a observed value \(x\), the estimated latent membership will be \(z_i = \arg\max_k P(z_k \| x)\).
Remember a key questions here is: whether we treat the membership score as an unknown fixed quantity, or as a random quantity in the model.
« Monty Hall Problem
NIPS workshop on Machine Learning for Audio »