As data grows in volume and complexity, so does the challenge of making sense of it. Uncovering the hidden structures within high-dimensional datasets is essential for drawing meaningful conclusions—from diagnosing diseases to understanding political behavior.

Zhongyuan Lyu, a postdoctoral research scientist at the Data Science Institute (DSI) at Columbia University, develops new statistical models to reveal meaningful patterns embedded in high-dimensional datasets. These datasets, characterized by thousands of variables or dimensions, present unique challenges.

I build mathematical and statistical tools to peel back the layers of high-dimensional data and reveal the meaningful structure underneath.

– Zhongyuan Lyu

Lyu designs methods to identify underlying structure in noisy, high-dimensional data, making it easier to interpret vast, noisy datasets and reduce computational demand. Lyu’s methods could have applications across genomics, neuroscience, political science, and beyond.

“I define myself as a statistician, using a background in real applications to develop methods and theories,” says Lyu, who is mentored by Professors Yuqi Gu (Statistics) and Kaizheng Wang (Industrial Engineering and Operations Research). “My work identifies patterns and latent structures in order to simplify and streamline the analysis of multidimensional data. In short, I build mathematical and statistical tools to peel back the layers of high-dimensional data and reveal the meaningful structure underneath.”

Theory-Driven Models, Application-Aware Solutions

Lyu earned his PhD in mathematics from the Hong Kong University of Science and Technology in 2023 before joining the Data Science Institute’s unique postdoctoral program, where scholars pursue interdisciplinary research under the mentorship of faculty at different Columbia schools. Lyu’s work sits at the intersection of statistical methodology, applied mathematics, and data science, fields he blends to address the fundamental challenges of analyzing complex, multidimensional datasets.   

Lyu focuses on methods that preserve the structure of large, complex datasets while reducing their computational burden.

“I study how to make sense of very large and complex datasets that can be organized like matrices (grids of numbers) or tensors (higher-dimensional versions of matrices),” Lyu says. “These datasets often have hidden patterns or structures—what we call ‘latent structures.’ My work focuses on developing mathematical tools and theories to uncover and use these hidden structures to better understand the data, especially when it’s high-dimensional or noisy.”

He focuses on developing statistical methods that simplify these datasets, reduce computational costs, and make analysis more efficient across disciplines.

From Genomics to Recommendation Systems

A defining principle of Lyu’s work is methodological generality—building models that apply across domains, from biology to social science.

Insights derived from one field such as genomics may inform solutions in another, from recommendation systems or political analysis. 

“My goal is to formalize problems into statistical problems, and then develop methodology that is general enough to be applied to other contexts,” says Lyu.

New Models for Complex Clustering 

Lyu and his collaborators recently developed a new model and algorithm for detecting clusters in complex, heterogeneous datasets. They successfully tested their approach on three different real world data sets: U.S. congressional voting records, genetic variation data, and single-cell DNA sequencing involving more than 30,000 samples and over 300,000 dimensions of genomic data. 

Their paper on this work, Degree-heterogeneous Latent Class Analysis for High-dimensional Discrete Data,  was recently published in the Journal of the American Statistical Association.

Adaptive Transfer Clustering Across Domains

Another project explores how information and patterns identified in one dataset can offer insights for another.  Lyu’s team developed an adaptive transfer clustering (ATC) algorithm—an approach that applies statistical insights gained in one environment to another in the same domain. This work is described in the preprint Adaptive Transfer Clustering: A Unified Framework, including examples of simulations and real data experiments that confirm their method’s effectiveness.

Looking Ahead

After Lyu concludes his research at Columbia in Summer 2025, he is joining the University of Sydney Business School as Assistant Professor of Business Analytics. “I’m grateful to DSI for the chance to collaborate across disciplines, says Lyu. “And I also look forward to bringing that spirit of teamwork and innovation to Sydney, where I can continue working with students and colleagues to tackle new data challenges.”