Data Science Day at Columbia 2016

March 17, 2016

Across Columbia, researchers are using big data tools to make breakthroughs in their field. Our second-annual Data on a Mission event will feature lightning talks by professors applying new techniques to problems in artificial intelligence, marketing, health care, security and the urban and natural environment. The second part of the day will feature live demos and opportunities to speak with researchers about current projects. Dan Doctoroff, CEO of Sidewalk Labs, will give the opening keynote, “The Coming Technological Revolution in Cities.” The event will conclude with a networking reception. Summaries of the lightning talks are listed below.

*Sidewalk Labs CEO Dan Doctoroff will give the opening keynote.*

DATA SCIENCE DAY | Columbia University | Apr. 6, 2016 | 9 am to 6 pm | REGISTER

Mining Images, Speech, Text and Social Ties for Insights and Important Events

Shih-Fu Chang Exploring Multimedia Recognition Tools in Big Data Applications

Advances in computer vision and the growth of digital photos and videos have created new opportunities to integrate content-recognition tools with mobile apps and large-scale systems. If you want more information about a building, product or bottle of wine, it’s now possible to search the Web with an image on your phone. New 3D sensors and search tools allow users to scan real-world objects and find matching models to make new products. Emerging multimedia-recognition tools, like Columbia’s NewsRover system, are making it possible to track and summarize breaking news from streaming video and social media. Similar technologies are embedded in smart search engines that can mine video footage from sporting events, roads and security cameras to flag key events, from touchdowns to traffic accidents to criminal activity. I will give an overview of the novel technologies we are developing and discuss open issues.

*Columbia’s NewsRover system searches for and summarizes key events in streaming video. (Courtesy of Shih-Fu Chang)*

Julia Hirschberg Applications for Detecting Emotion in Text and Speech

Identifying the emotional content of written and spoken language is increasingly useful in business, medicine and security. Large data sets of text and speech, including social media, interviews and phone conversations, can be used to train systems to detect consumer reactions to products and services (and to flag ‘fake’ reviews), to diagnose medical conditions such as depression, and identify deception in a wide variety of government, business and social service settings. Each application picks up subtle cues that may indicate whether a speaker is angry, happy, disgusted, afraid, sad or surprised. Similar approaches have been used to distinguish among personality traits, and to infer how tired, drunk or bored someone might be.

Kathy McKeown Tracking Events Through Time: Objective and Personal Views

*Streaming news feeds from Egypt’s 2011 Tahrir Square protests were among the inputs Kathy McKeown and her colleagues used to build their disaster-update software. (Jonathan Rashad)*

The chaos following Hurricane Sandy in 2012 brought home the need for a faster, more accurate way to filter the oceans of text streaming over social media and news sites during and after a crisis. We have been working on an automated method for monitoring and summarizing news as events unfold. Our method can flag new information as it becomes available, and generate updates. This can be extremely useful during emergencies as well as for tracking a wide variety of everyday events. In a related project, we’ve come up with a way to automatically identify the most compelling part of a personal narrative, what we call the “most reportable event.” I will discuss the natural language processing techniques that underlie this work, and future research directions.

Tian Zheng Mapping Subpopulations within Big Networks

Estimating the size of stigmatized groups such as the homeless, people with HIV and commercial sex workers remains difficult, even in the digital age. Those belonging to marginalized subpopulations may be difficult to reach by phone, or in online surveys, or may simply prefer to keep sensitive personal information to themselves. Advances in network science are now allowing researchers to move past these obstacles to learn more about hard-to-reach demographic groups. My colleagues and I have developed a modeling framework to infer the size, and other hidden features, of subpopulations within a large study sample. Our method produces inferential results that are easy to interpret and relevant for visualizing, monitoring and understanding structures underlying large, complex networks.

In a survey, respondents were asked how many people they knew with the names Robert, Jaycee and Christina. Tian Zheng and her colleagues mapped those names to a hypothetical social network, where they inferred that groups of people who are homeless, in prison, or have AIDS, were mostly male. (Courtesy of Tian Zheng)

Developing Algorithms that Know Your Likes and Dislikes Better Than You

Shipra Agrawal Explore and Exploit: Because You May Not Know What You’re Missing

To improve its movie recommendations to subscribers, Netflix looks at what subscribers liked in the past to predict future preferences. But that method leaves out movies subscribers might like even better but don’t know about. Amazon faces a similar problem in recommending products to its customers. Discovering the full range of possibilities involves a trade-off between exploration and exploitation of data. Many sequential decision making problems are rooted in this problem, including recommendation systems, online advertising, content optimization, revenue and inventory management, and even teaching computers to play games like Pong and Go. I will discuss how machine learning and optimization techniques can be combined to achieve near-optimal trade-offs between exploration and exploitation.

Olivier Toubia Recommending Movies by Featured Character Traits

Most movie recommendation systems rely on viewers’ past preferences. We propose an alternative that taps into viewer preferences for stories featuring positive character traits such as kindness, fairness and humility— a finding documented in the media psychology literature. Borrowing from the positive psychology literature, we have developed a character-based classification system that is easy to interpret, communicate and act on. We have also developed a companion natural language processing tool that can infer character traits from movie summaries. In two online studies, we show that character traits are a strong predictor of what movies people like. Our results apply to films that achieve critical acclaim as well as box-office success. We show that character-based classification works for models that use content alone, and content with collaborative filtering, to predict viewer behavior.

The Moneyball Approach to Healthier Living

Instead of comparing two data streams manually, as in figure A, one data stream is collided with the inverse of the other, as in B, to produce a combined stream that reflects the similarity of the original streams. This hybrid stream can be used in classification, decision and optimization problems. (Courtesy of Hod Lipson)

Hod Lipson Data Smashing: Uncovering Order in Data Streams

From speech recognition to the discovery of new stars, almost all automated tasks involve comparing streams of data for similarities and outliers. Automated discovery methods, however, have not kept pace with the exponential growth in data. One reason is that most algorithms depend on humans to define what features to compare. Here, we propose a new way to match multiple sources of data streams without any prior learning. We show how this principle can be applied to challenging problems, including the interpretation of EEG patterns in epileptic seizures, the detection of abnormal heartbeats in ECG data and classifying astronomical objects from light measurements. Our data smashing principles produce results as accurate as algorithms developed by domain experts, and could open the door to understanding increasingly complex observations that experts don’t yet know how to interpret.

David Madigan Observational Studies: Promise and Peril

Randomized experiments are the gold standard in measuring the effects of interventions in medicine, education, social science and other areas. In reality, researchers often rely on observational studies, leading to vast numbers of contradictory findings published in scholarly journals and widely disseminated through the media. Decision makers and the public assume that a rigorous peer-review process guarantees that these results are valid. This is not always so. Well-intentioned analysts make design choices, run analyses and publish their results overlooking the possibility that different choices may have produced entirely different results. I will provide an overview of the current state of the art in observational studies in healthcare and describe some promising research directions.

Lena Mamykina Predicting Blood-Glucose Levels to Manage Diabetes

*The above tool allows diabetics to plan each meal based on their individual physiology and real-time blood-sugar levels. (Courtesy of Lena Mamykina)*

Advances in personal health tracking promise to help individuals gain deep insights into their health and behavior. Yet, most health apps still rely on humans to identify trends, make discoveries and take action. In this research, we are building computational models and interactive decision-support tools to help type 2 diabetics improve their nutritional choices. Our decision-support tool forecasts how a planned meal will influence blood-glucose levels based on an individual’s physiology and past data. Early results suggest that this automated prediction tool may produce more accurate assessments than individuals or their healthcare providers can.

Adler Perotte Predicting Kidney Disease Progression with Large-Scale Patient Data

Columbia University coordinates a global network of health databases known as the Observational Health Data Science and Informatics (OHDSI) collaborative. With hundreds of millions of patient records, OHDSI allows researchers to look for large-scale patterns that can reveal new ways to identify and treat disease. In a recent study, my colleagues and I used observational health data to build a model to predict how likely a patient with stage 3 kidney disease, in which the kidney has lost half of its function, will progress to stage 4, with up to 90 percent loss. Our model, which incorporated patient lab test results and clinical records, outperformed models that did not include this information. Identifying patients at high risk for disease progression allows doctors to customize treatment that can stall or prevent its progression.

Measuring and Addressing Social and Environmental Problems in Cities

In an ongoing experiment, Fred Jiang is measuring individual energy use in Columbia’s Northwest Corner building. Above, one student (in red) has consumed more energy than another (in blue). (Courtesy of Fred Jiang)

Donald Davis Mining Yelp Reviews to Measure Segregation in New York City

Until they were dismantled in the mid-1960s, the segregationist Jim Crow laws in the southern United States severely limited social interactions among ethnic groups. Despite the Civil Rights Act and later reforms, the U.S. remains deeply segregated, even in northern cities like New York. While standard measures of segregation exist for residences, jobs, and schools, we currently have no way of measuring how segregated common public activities like going to restaurants is. By studying five years of Yelp reviews in New York City, my colleagues and I provide the first estimate of diversity in city restaurants. Early results suggest that dining patterns are also segregated, though not as markedly as in housing.

Fred Jiang Smart Systems for Monitoring Air Pollution and Personal Energy Use

Analyzing observations of the physical world can be a messy process. But the rise of sensors to measure air quality, ocean temperatures and any number of other changes is allowing us to study our environment and actions like never before. I will discuss two projects that use intelligent sensor systems to map the environment. In one, my colleagues and I combined inexpensive, custom-built Internet-connected sensors with cloud-based data analysis to measure and infer air-quality at city scales. In a second project, here at Columbia, my lab is combining building energy-use monitoring with location data to estimate an individual’s energy footprint to provide real-time feedback to cut energy use.

Desmond Patton Preventing Gang Violence through Social Media Analysis

In the above tweet, gang banger Gakirah Barnes vows to avenge the death of her friend, Rasaan “Lil B” Patterson, who was allegedly killed by police. The “100” emoji signifies that she means business. (Courtesy of Desmond Patton)

Social media is often an extension of the street for gang-involved youth. They may taunt rival gang members, downplay shootings and brag about fights and drug deals. Sometimes the tough talk turns into real violence. To be able to intervene, social workers need to understand how likely a specific post on Twitter may lead to violence. To do so requires deciphering the coded language and culture of gang-involved youth. I have recently collaborated with social science researchers and data scientists to analyze Twitter posts by Chicago gang members. Our goal is to combine observations with natural language processing tools to detect and decode high-risk language. I will discuss our process and early results.

Innovations to Keep Data Secure

Jason Healey Building a Defensible Cyberspace

Cyber attacks top the list of national security threats and also pose a threat to our personal finances, as recent data breaches at banks, credit card companies and businesses have shown. One of the main reasons that cyber threats are escalating is that for decades it has been far easier to attack than defend. Columbia’s School for International and Public Affairs (SIPA) has convened a New York Cyber Task Force to bring together policymakers and technologists in academia, banks, and other industries, to determine how to reverse the problem so that a dollar of defense buys more than a dollar of attack. A defensible cyberspace has checks and balances and a broad set of stakeholders acting as stewards. It can adapt to changing conditions, recover quickly after failure and scale up solutions. I will discuss what technologies and policies have been most successful to date and what more is needed.

Tal Malkin Secure Computation: Encrypted Search and Beyond

Secure computation is one of the most exciting achievements in cryptographic research in the last few decades. It allows mutually distrustful parties to jointly perform computations on private data without revealing any extraneous information. Once a theoretical field, secure computation is becoming increasingly more practical and relevant to real-world applications. I will discuss a private database management system that we have developed, Blind Seer. This system allows clients to perform a rich set of queries over an encrypted database while keeping the data and query hidden. Blind Seer runs efficiently on a 100-million record, 10-terabyte database — two to 10 times slower than running insecure MySQL queries on a non-encrypted database.

— Kim Martineau