Data Science Institute (DSI) and Irving Institute for Cancer Dynamics postdoctoral research scientist Mingzhang Yin focuses on problems related to machine learning, Bayesian statistics, and causal inference. And as a future professor, he excels at clearly describing such complex subject matter.

“Causal inference is one of the most important ideas in statistics. It explores the difference between the observed world and the counterfactual world,” Yin explained. “The observed world is the one we live in now, where action leads to outcome. The counterfactual world considers what if this action hadn’t happened? What if, for example, I hadn’t taken the vaccine. Would I have gotten sick? Causal inference investigates the difference between the two worlds.”

As machine learning excels at understanding complex associations in data, it also advances past tasks like image recognition and into more complex, decision-making tasks in medicine, finance, hiring, autonomous vehicles, policy, and beyond. New, complex issues involving transparency, interpretability, and robustness are emerging, and these involve causal relationships, and not just association.

Yin’s current research involves developing new methods and building new tools at the intersection of statistical machine learning and causal inference. He collaborated with David Blei, who is a professor of statistics and computer science at Columbia University and a DSI member, and other researchers from the Blei lab to publish two recent papers that demonstrate this approach.

Optimization-based Causal Estimation from Heterogenous Environments provides a new optimization approach to causal estimation—a novel way to distinguish between spurious association and genuine causation. This work attempts to determine which covariates are causes of an outcome and the strength of the causality. 

“Spurious association is like the connection between chocolate consumption and Nobel prize winners—one is predictive to the other, but it is unlikely that one causes the other,” Yin said. “Our approach is to see how these relationships play out in different environments or contexts to try to identify a common model that underlies these heterogeneous environments.”

The authors also describe the theoretical foundations of their approach and demonstrate its effectiveness on both simulated and real datasets.  

The other paper, Conformal Sensitivity Analysis for Individual Treatment Effects, explores ways to better understand the effect of treatment on an individual.

“Common causal effects are the ones averaged over a population, but people may want to know what their own, individual results could be,” Yin said.

This research proposes a sensitivity analysis that may provide the possible range of an individual treatment effect. The authors evaluate the method on synthetic data and illustrate its application in an observational study.

Yin completed his undergraduate degree in mathematics at the Fudan University in Shanghai and his Ph.D. in statistics at the University of Texas at Austin. He came to Columbia for postdoctoral research in 2020 and will join the faculty of the University of Florida’s Warrington College of Business this fall.

When asked about a counterfactual personal narrative—how his life would have been different if he had not come to the U.S.—Yin replied: “If we use a synthetic control method for the intervention of me not moving to the U.S. for graduate school at the age of 22, we might take a weighted average of the life paths of my cohorts in Shanghai. Likely, I would be working in a technology or financial company, or as a math professor in China. Quite a different counterfactual path.” 

— Karina Alexanyan, Ph.D.