Yinqiu He was drawn to statistics by the discipline’s balance of theoretical mathematics and applied work.

“Statisticians help scientists analyze, understand, and get useful information from their data,” she said. “I enjoy building tools that can help other scientists and practitioners solve their problems…the best thing about being a statistician is that you get to play in everyone’s backyard.”

He completed her doctorate in statistics at the University of Michigan after receiving a bachelor’s degree in statistics from the University of Science and Technology of China. Today, as a postdoctoral research scientist at Columbia University’s Data Science Institute (DSI), she develops theory and methodology to analyze complex-structured data to address interdisciplinary scientific problems. Her research interests include high dimensional and large-scale statistical inference, rare event simulation, mediation pathway analysis, network analysis, and statistical machine learning with applications in statistical genetics and genomics and metabolomics. 

With high dimensional data, each subject has multiple characteristics or variables associated with it, according to He. “The question becomes, ‘How can we make sense of the high dimensionality of data to get information that is useful? How do you pull from all these features and characteristics to get insights and knowledge?’” 

He chose to spend a year at DSI to expand her research profile and make new connections. “I’ve had the opportunity to open my mind and broaden my horizons to see how broadly data science is playing a role in different areas outside statistics,” she said. “I’ve learned about many different interesting problems in new and interesting areas, and I appreciate the exposure to the diversity of data science applications and to many different people.”

For example, Columbia statistics professor Zhiliang Ying and He collaborate to develop methods to analyze complex student performance data, including National Center for Educational Statistics data, and evaluate the fairness of test questions by looking for evidence of differential item functioning (DIF). A test question is labeled as having DIF when people with the same latent ability who are from different groups have an unequal probability of giving a correct response.

“When tests are designed, we hope that given a student’s abilities, the probability that they answer the question correctly should be the same. We don’t want other factors like gender, race, or their age to influence that probability,” He explained. “Modern educational assessment data is high dimensional because each student answers a large set of test questions. This requires high dimensional data analysis to gain insights into a student’s ability levels and the difficulty of the questions. That’s where I want to put my knowledge to use.”

He’s interest in large, dynamic networks also led her to analyze data from Citi Bike, including clusters of users across city locations and how active users are at different times and days. “Our goal for the project is to extract both the underlying geographic structure and the time-evolving effect from the dynamic network simultaneously. This could help us understand user behavior and guide future allocation of bike stations.”

After Columbia and DSI, He will join the University of Wisconsin at Madison faculty as an assistant professor of statistics.

“Life is complicated, but math is simple,” she said. “When I focus on math, my spirit soars beyond this complicated world, and I feel happy. I’m lucky to be able to focus on work that is pure and not complicated in that way.”

—Karina Alexanyan, Ph.D.