Master’s students work on cutting-edge university research

Before they graduate, DSI master’s students must complete Capstone projects in which they apply the data-science techniques they’ve learned in their classes to real-world problems. The semester-long projects challenge students to work in teams and use data science to solve pressing societal problems.

Until now, the Capstone teams have been advised by DSI’s industry affiliates, who assign projects and data sets to the students and mentor them them through their research. The Data Science Institute sponsors the Industry Affiliates Program to to develop mutually beneficial interactions with its industry partners. And based on the success of the industry-sponsored Capstones, DSI has built a parallel program that pairs student teams with Columbia faculty advisers. The faculty-advised Capstones offer students a chance to work on “cutting-edge academic research across the university,” says Tian Zheng, associate director for education at DSI and professor of statistics at Columbia.

This semester, eight Capstone teams are working on fundamental academic research with professors, adds Zheng, who originated the idea for the faculty-sponsored Capstones. Each of the eight projects was approved by a DSI committee and all of the projects support the Institute’s overarching goal to use “data for good.” Whereas industry-sponsored projects seek solutions to problems in the private sector, the faculty-sponsored projects focus on solving challenges within fundamental academic research.

“I’m really excited by the breadth of the research and caliber of the faculty advisers,” says Zheng. “With direction from faculty, the teams will use data science to enhance several fields of research at Columbia, such as biology, nursing, neuroscience and space studies. It’s a way of expanding our Capstone program to offer students a chance to conduct either industry or academic research, depending on their interests and inclinations.”

The eight academic projects are as follows

  • Using Climate Data to Predict Heavy Snowfalls (2 Teams)
  • Broadband Internet Availability
  • Studying the Formation of Biofilms to Diagnose and Treat Infection
  • Supporting the Caregivers of People with Alzheimer’s and Dementia
  • Understanding the Genetic Basis of Exploratory Behavior in Animals
  • Turbulence in Environmental Flows
  • Monitoring Carbon Sink to Predict Climate Change

Using Climate Data to Predict Heavy Snowfalls

Gavin SchmidtAs the earth gets warmer, two predicted consequences are that there will be more water vapor in the atmosphere (about a 7 percent increase in water vapor for every degree C warming) and more intense precipitation. Observations have confirmed these increases in recent decades. But what effect will these increases have on winter snow storms? Does this mean increased snowfall or less because it’s warmer, and will the storms become more or less damaging? This team will use a large data set (700 gigabytes) of 50 regional climate simulations covering the North American East Coast region from Georgia to Newfoundland. The climate data come from Ouranos, a consortium in Montreal that conducts regional climate simulations in North America. The team will calculate the frequency of large snow storms and track how the storm statistics will change regionally every decade through 2100. Observations suggest that there has been an uptick in the biggest snowstorms in parts of the region over the last ten years, even as the climate of the earth has warmed. The team will analyze the model statistics and determine whether this is indeed a signal of climate change or a function of natural variability. Being able to predict big storms is important, since heavy snowfalls along the East Coast are expensive to contend with and can have a major disruptive influence. Having more accurate predictions of what to expect could thus help emergency officials plan for and mitigate the harm from big snow storms.

Faculty Adviser: Gavin Schmidt, a climate scientist and director of the NASA Goddard Institute for Space Studies. He is also adjunct researcher at the Columbia Earth Institute and a member of the Data Science Institute.

Broadband Internet Availability

Henning SchulzrinneThe Federal Communications Commission (FCC), the federal agency that regulates interstate communications, provides extensive data sets on broadband internet availability, with various data sets reaching back to 2010. Known as Form 477 data sets, they contain data on carriers serving each U.S. census block, the smallest geographic unit used by the Census Bureau for tabulating data from all houses. The number of census blocks in the U.S., including Puerto Rico, for the 2010 Census was 11,155,486. This team will use those data combined with census and funding data from the Universal Service Administrative Company, a group designated by the FCC to help bring broadband service to people in rural, underserved, and difficult-to-reach areas. The team will seek to answer key questions such as: What demographic factors predict deployment of broadband over time, particularly in rural areas? How do carriers of different types expand? Does public funding, through the Universal Service Fund subsidy, measurably impact deployment speed or areas, and has the role of non-traditional broadband providers changed? The Form 477 data is known to overstate the availability of broadband and contain errors such as internet access that disappears and then reappears a year later. Can the team estimate those errors, for example, by looking at panel-data models – multidimensional data involving measurements over time? And can the team supplement the data with other data sources? The goal is to help policy makers at the state and federal level understand where broadband is likely to be deployed “naturally,” without subsidy, how soon this is likely to happen, and where policy interventions such as encouraging alternative providers or granting subsidies are necessary.

Two Capstone teams will work on this project.

Faculty Adviser: Henning Schulzrinne, Levi professor of computer science at Columbia and member of DSI. He served as the Chief Technology Officer for the U.S. Federal Communications Commission from 2011 to 2014.

Studying the Formation of Biofilms to Diagnose and Treat Infection

Lars DietrichA biofilm is a community of bacteria encased in a scaffold that protects the microorganisms, which makes it harder to treat them with antibiotics. Biofilm structures vary depending on the types of bacteria–the bacterial species and their genetic makeups, as well as external conditions, such as temperature and outside oxygen levels help determine biofilm structure. If researchers could understand biofilm formation, they may be able to disrupt the aggregating process. This Capstone team will study images of biofilms and attempt to quantify their structures and distinct morphologies. Images of certain biofilms reveal complex, wrinkled surfaces. Scientists have studied the qualitative nature of these surfaces, but little research has been done to quantify the structures. To do that, this team will grow biofilms under specific conditions, changing variables such as temperature and humidity to study the morphological responses of the bacterial communities and understand how genes contribute to their structures. By determining which genes influence biofilm structure, researchers hope to identify pathways that could be targeted by new drugs. The project could also help develop tools that hospitals could use to diagnose bacterial conditions. Furthermore, if the team can learn how to prevent the formation of biofilms, they might be able to suggest ways to help make conventional antibiotics more effective in treating infections.

Faculty Adviser: Lars Dietrich, associate professor of biological sciences at Columbia.

Supporting the Caregivers of People with Alzheimer’s and Dementia

Suzanne BakkenFamily caregivers of persons with Alzheimer’s Disease and other dementias have it doubly hard. They must manage their own symptoms – depression, anxiety or ailments relating to ageing (caregivers are often older and face chronic diseases themselves) – while caring for the person living with Alzheimer’s or dementia. In previous research, faculty adviser Suzanne Bakken created a data set from Tweets relating to the family caregivers. She examined the social network to understand the mechanisms of social support available to caregivers, including informational, instrumental and emotional support. This Capstone team will apply data-science techniques such as topic modeling, sentiment analysis, and social-network analysis to understand caregivers’ symptoms of mental and social health and how they manage them. The team will use topic modeling to understand which symptoms are relevant; sentiment analysis to understand which symptoms are especially worrisome; and social network analysis to understand sources of support. The goal of the project is to design a Twitter-based method to elucidate symptoms and to support the design of interventions that will help caregivers manage symptoms of the person with Alzheimer’s or dementia as well as their own symptoms, which in turn will help them better provide for their loved ones.

Faculty Adviser: Suzanne Bakken, alumni professor of nursing and professor of Biomedical Informatics at Columbia and member of DSI, where she co-chairs the Health Analytics Center.

Understanding the Genetic Basis of Exploratory Behavior in Animals

Andres BendeskyAnimals use different strategies to explore their environment. Some prey species are more willing to explore open areas while others avoid these open spaces, where predators can lurk. Faculty adviser Andres Bendesky has tracked the exploratory behavior of 1600 hybrids of two species of deer mice that show extreme differences in how they explore open areas. He has genotyped each hybrid throughout the genome to map the regions of it responsible for behavioral differences between the species. That information, in combination with data from video recordings of the animals in the laboratory, will help the team determine the genetic basis of the exploratory behaviors. The team’s analysis will consists of searching for places in the genome that are statistically associated with differences in behavior among the hybrids.

Faculty Adviser: Andres Bendesky, assistant professor of ecology, evolution and environmental biology at Columbia and principal investigator at the Zuckerman Institute.

Turbulence in Environmental Flows

Pierre GentineThe primary objective of this project will be to demonstrate that researchers can develop machine-learning techniques to represent turbulence in environmental flows, known as subgrid parameterization (i.e. the numerical representation of turbulence in coarse-grain models) by using high-resolution and turbulent-resolving simulations. The team will also use deep learning and convolutional neural network techniques to inform the development of new models of turbulence, which could be the new generation of turbulent models at a fraction of the high resolution model computational cost. The project can help to solve a variety of environmental problems, such as pollution, heat transport and climate change.

Faculty Adviser: Pierre Gentine, associate professor in the department of Earth and Environmental Engineering and a member of DSI, where he belongs to the Smart Cities Center.

Monitoring Carbon Sink to Predict Climate Change

Galen McKinleyClimate is changing due to human emissions of carbon to the atmosphere. But not all the carbon emitted remains in the atmosphere. In fact, over the course of the industrial era the ocean has absorbed the equivalent of 41 percent of all human fossil-fuel derived carbon dioxide emissions, a phenomenon, known as “sink.” Studying the ocean carbon cycle is thus critical to understanding and predicting climate change. It’s also essential for efforts that seek to limit climate change by reducing the growth rate of atmospheric CO2 concentrations. Ocean data are quite sparse, and CO2 in water cannot be directly measured from space. This Capstone team will use machine-learning methods to develop mapped estimates of surface ocean CO2 concentrations from the limited data that are available.

Faculty Adviser: Galen McKinley, professor of earth and environmental science at Columbia and a faculty member at Lamont Doherty Earth Observatory.

At the end of the semester, the teams will present their findings to their fellow students and advisers.

You can click here to learn more about the Capstone program.

Faculty click here to submit projects for Spring 2018.