# DSI Researchers Study an Influential Algorithm

The Expectation Maximization (EM) algorithm is a primary workhorse for fitting statistical models to large and complex data sets. Modern statistical models, called latent variable models, reveal hidden patterns that explain complex observations found in data, and have been used to understand community structure in social networks through observed links between people, as well as to uncover human ancestry through genetic markers in present-day individuals. But despite the widespread use of EM for fitting these models, relatively little is known about when the EM algorithm actually fulfills its intended purpose. All previous results about EM have come with extra “fine print” that can leave users of the algorithm doubting its effectiveness.

Now, researchers from the Data Science Institute have proved for the first time that EM does work as intended for at least one non-trivial latent variable model: a mixture of two multivariate normal distributions.

The researchers published their findings in a paper, “Global Analysis of Expectation Maximization for Mixtures of Two Gaussians,” in the 2016 proceedings for the Conference on Neural Information Processing Systems (NIPS). The result is important because it settles decade-old questions of convergence and consistency for EM in a natural latent variable model and it affirms empirical observations made by practitioners using EM.

DSI professors Daniel Hsu and Arian Maleki authored the paper with computer science doctoral student Ji Xu, who presented the findings at the NIPS conference. Hsu, an assistant professor of Computer Science and Maleki, an assistant professor in the Department of Statistics jointly advise Xu, who is the first author of the paper. Both Hsu and Maleki are affiliates of the Data Science Institute.

The algorithm, introduced in a 1977 paper that’s been cited more than 50,000 times, is a computational procedure for fitting latent variable models to data. IEEE has ranked EM as one of the 10 most influential algorithms in data mining. EM begins with an initial “guess” for the model that fits the data, and then iteratively revises this guess to better fit the data. The hope is that this process eventually yields the best fit model, called the maximum-likelihood solution. Maximum likelihood has been studied in statistics for more than a century and in many ways is optimal. Whether EM actually returns the maximum likelihood solution, however, depends on the initial guess, says Hsu. In general, it’s possible that with certain initial guesses EM converges to a solution that is far from a good solution. Practitioners thus often resort to running the algorithm repeatedly from different initial guesses, with the hope of chancing upon a guess that leads to a good solution.

Nonetheless the main result of the study, says Hsu, is that for a particular latent variable model—a mixture of two multivariate normal distributions—EM does in fact converge to a solution close to the maximum likelihood solution for almost all initial guesses.

There are still many limitations of the theorem, explains Hsu. First, the analysis does not consider what happens to EM when the data is not well-fit by the model under consideration. If there are outliers in the data, for instance, the theorem is no longer applicable. Second, it is known that an analogous theorem is not true for models even slightly more complicated than the one under consideration. Indeed, another group of researchers proved that EM almost always fails to return a good solution for mixtures of three multivariate normal distributions. Thus, there is still much work to be done to fully understand the behavior of this important algorithm.

“Our main theorem gives users of EM some cause for optimism,” says Hsu. “For the latent variable model we consider, we proved almost all initial guesses lead to good solutions. So in cases where the model is indeed a good fit for the data, finding that fit can be easy to do with EM.”

**--By Robert Florida**