Foundations of Data Science

DAVID BLEI, Professor of Computer Science and Statistics Photo by Matt Lenz

Consider the challenge of the modern-day researcher: Potentially millions of pages of information dating back hundreds of years are available to be read from a computer screen. How does a simple Internet search deliver appropriate findings?

The answer is topic modeling, a mathematical approach to uncovering the hidden themes in collections of documents. David Blei, professor of computer science and statistics, led the groundbreaking research that resulted in the development of the Latent Dirichlet Allocation (LDA) model, a topic model capable of exploiting hidden topics and semantic themes among countless lines in billions of documents via language processing. By co-developing the LDA, Blei effectively influenced the research and development of topic models and machine learning.

“We were fortunate that LDA has become influential for analyzing all kinds of digital information, including text, images, user data and recommendation systems, social networks, and survey data,” says Blei. “Even in population genetics, LDA-like ideas were developed independently and are used to uncover patterns in collections of individuals’ gene sequences.”

Blei, whose challenges and rewards both come from solving large-scale machine learning problems, appreciates the interconnectivity of disciplines within his field.

“Columbia has an extremely strong faculty in machine learning and statistics, and across other departments, such as in medicine and neuroscience. This makes it an ideal university for my group to pursue our interdisciplinary research agenda,” he says. “I like working on interesting applied problems about data—in the sciences, social sciences, and humanities—and letting real-world problems drive the methodological advances that we develop.”

Blei’s interest in data science and machine learning was reinforced by the work of Herbert Robbins, a former professor of mathematical statistics at Columbia. “Robbins developed an algorithm that allows the machine learning community to scale up algorithms to massive data,” says Blei, who predicts that one of the next breakthroughs in modern machine learning and statistics will involve observational data, or that which is observed but not collected as part of a carefully controlled experiment.

“Previously, machine learning has focused on prediction; observational data has been difficult to understand and exploring that data was a fuzzily defined activity,” Blei explains. “Now, massive collections of observational data are everywhere—in government, industry, natural science, and social science—and practitioners need to be able to quickly explore, understand, visualize, and summarize them. But to do so, we need new statistical and machine learning tools that can help reveal what such data sets say about the real world.”

Prior to joining Columbia Engineering in July of 2014, Blei was an associate professor in the Department of Computer Science at Princeton University. He is a faculty affiliate of Columbia’s Data Science Institute. Blei is a recipient of the 2013 ACM-Infosys Foundation Award in Computing Sciences and is a winner of an NSF CAREER Award; an Alfred P. Sloan Fellowship, and the NSF Presidential Early Career Award for Scientists and Engineers. His accolades also include the Office of Naval Research Young Investigator Award and the New York Academy of Sciences Blavatnik Award for Young Scientists.

BSc, Brown University, 1997; PhD, University of California Berkeley, 2004

-by Dave Meyers

Source: http://engineering.columbia.edu/david-blei-embracing-science-teaching-to…