Capstones Mine Big Data Sets for Novel Insights

Working with the Institute's industry partners, 14 student teams presented their capstone projects last month, tackling problems in fields ranging from finance and energy to aviation and public health.  "We had another outstanding set of data science projects this semester," said Eleni Drinea, a lecturer in computer science who led the capstone course with her colleague  Owen Rambow. "The students worked on all aspects of data analysis, from acquiring and cleaning the data, to building efficient data-storage systems to analyze and visualize it. Experimention with diverse approaches, they provided useful conclusions to their industry mentors." Below is a summary of their work.

The team found that Microstrategy Inc. dramatically changed its annual report language in 2013, replacing words like “business,” “result” and “customer” with finance terms like “loan,” “bank,” “mortgage” and “dividend.”

Measuring Shifts in How Firms Report Financial Performance

Automated text-analysis tools are increasingly being applied to financial reports to allow investors to quickly evaluate firm performance. In a project for Goldman Sachs, the team developed a visual tool to measure changes in word-choice and topics over nine years of 10Q and 10K reports for 9,000 companies.  They measured word changes by computing the cosine distance of the TF-IDF representation of the documents, and topic changes by computing the distance between topic vectors generated by LDA and word-embedding models. They found that topics changed by about 10 percent in any given year, except for 2008, when language shifted sharply to reflect the added risks brought on by the financial crisis.

Students: Juan Borgnino, Manuel Rueda, Hiroaki Suzuki, Xuyan Xiao, Shenghan Yu

Of the three methods the team used to predict battery lifespan, the LSTM Neural Nets method came closest to reproducing real-world results, as shown above. 

Predicting a Lithium Battery’s Lifespan

Lithium-ion batteries power everything from cellphones to electric cars, and predicting how long each one will last is critical for companies trying to optimize their battery-replacement schedules. In a project for General Electric, the team used NASA battery-performance data to predict the capacity and lifespan of an industrial lithium-ion battery. They experimented with three methods for training their model—Bayesian regression, Extreme Gradient Boosting (XGBoost), and the Long Short-Term Memory (LSTM) neural network—and found that Bayesian regression worked best with historical data and a physical model; XGBoost balanced efficiency and accuracy to predict a battery's end-of-life; and LSTM Neural Nets offered the most accurate look at the battery’s continuous behavior. With unlimited computing power, LSTM was the best method for predicting battery behavior down to the millisecond, they found. 

Students: Michael William Bisaha, Jiabin Chen, Rajesh Babu Madala
Jierui Song

Toward Increased Profitability By Trimming Expenses

In a project for the German software company, SAP, the team analyzed employee travel expenses to identify ways the company increase profitability by cutting costs. Capping hotel stays at $250 per night would have reduced expenses by 16 percent over a three-year period, they found. Instituting an $85 daily cap on meals would have cut expenses by 30 percent in 2015 alone. They discovered that SAP employees could save significantly on air travel by flying during the summer months and booking round-trip flights on budget airlines such as Southwest Airlines and Ryanair. (November flights, they found, cost nearly twice as much as flights in July.) From the subset of data they analyzed, the team calculated that SAP could have saved $112,000 per year if a moderate spending plan was instituted. Savings would be substantially more if the policy was adopted company-wide.

Students: Arushi Arora, Jiannan Zhang, Longdai Zhang, Yitong Zhong

Viewers that liked Star Wars: The Phantom Menace might also like Pathfinder and Suicide Squad, the team’s recommendation engine found.

A Movie-Based Recommendation System

Recommendation systems depend heavily on user reviews and other proprietary data to personalize their suggestions for news, movies and music. In a project for Synergic Partners, the team shows that a recommendation system using only public data can work surprising well at predicting what a user might like based on a previously viewed movie. Drawing from the 30,000-movie Internet Movie Database, or IMDb, the team built two recommendation models: one based on movie plot, and the second, on features including actors, movie genre and the size of production staff. A related app provides two recommendation-lists based on the user’s pick. In a demo for their classmates, the team showed that if you liked Toy Story you might also like Tiny Furniture or Duck Tales.

Students: Carlos Espino Garcia, Arum Kwon, Qitong Liu, Tianhao Lu

Predicting Brexit from Comments Posted on Twitter

In a project for Synergic Partners, the team analyzed nearly 433,000 tweets just before Britain’s historic vote in June 2016 to leave the European Union. They wanted to understand if views expressed on Twitter in the three days before the Brexit election could have predicted Britain’s surprising decision to leave the E.U. After classifying the tweets and analyzing them with several models, the team found that twice as many tweets expressed support for leaving the E.U., in contrast to most polls that predicted Britain would vote to stay. The margin on social media was also significantly greater than the actual vote in which 52 percent of British voted to leave.

Students: Nikita Nataraj, Patrick Sean Rogan, Hui Su, Yaoru Yi

The above heat-map shows how pockets of crime in Chicago shifted to different neighborhoods over a six-year period.

Visualizing Gun Violence and Poverty in American Cities 

The Chicago-based nonprofit, Cure Violence, uses intervention programs and community outreach to stop the spread of violence. In a project for Cure Violence and the consulting firm Booz Allen Hamilton, the team combined gun-crime and census data for New York, Chicago and Baltimore, to understand the interplay between crime, poverty and other factors. An interactive tool they built lets users explore the data and put several trends in sharp focus: Chicago gun-crime peaks annually in July and tapers off in February; New York had the lowest gun-crime rates of the three cities, possibly due to its well-funded gun-prevention program; and gun crime in Baltimore has been steadily rising.

Students: Allison Fenichel, Alimu Mijiti, Gary Sztajnman 

Detecting Signs of Depression on Social Media 

About half a million New Yorkers suffer from depression at any given time, and many go untreated. In a project for New York City’s Department of Health and Mental Hygiene, the team scraped public social media posts from Twitter and Reddit to build a classifier able to flag posts associated with depression. Using a model trained on the scraped data, they built a web app in R Shiny that gives the probability that the author of any given social media post is depressed.   

Students: Jordan Chazin, Maura Fitzgerald, Aleksandr Makarov, Kuo Zhang   

Predicting Airspace Violations

In an analysis of 700,000 flights, the team found that long flights, bad weather and heavy air-traffic contributed most to airspace violations.

If an airplane flies into another plane’s airspace, its operator can be cited for an air-safety violation called “loss of separation.” In a project for Synergic Partners, the team analyzed 700,000 U.S. flights over a three-week period this past fall. They found that planes were more likely to violate each other’s airspace on long flights, in bad weather, and when more planes were flying at once. They also found that some airlines violated the five-mile by 1,000-foot vertical separation rule more than others. With more work, their model could be applied to real-time flight data to reduce the risk of airspace violations, they said.

Students:  Ignacio Aranguren, Sam Guleff, Phoebe Yijing Sun, Sherry Zhang

Summarizing beauty product reviews to infer customer satisfaction

Text data can provide valuable insights into how consumers feel about a product. In a project for Unilever, the team analyzed nearly 200,000 beauty product reviews on Amazon, and 450 Unilever focus-group responses, to evaluate techniques for inferring customer sentiment from massive data sets. They implemented two unsupervised methods for extracting summaries from multiple documents and found that a method called semantic volume maximization worked best. In the visualization at left, the closer a 140-word sentence (the blue dots) came to summarizing a focus group response the closer it appears on the three axes.

Students: Shuni Fang, Joshua Daniel Safyan, Junhui Lia Shen, Shengzhong Yin

A Lyric-based Music Recommendation System

When Spotify and Pandora recommend new music, lyrics are rarely part of the equation. In this project for Synergic Partners, the team used data science techniques to analyze nearly 13,000 songs from over an 80-year period ending in 2008. They built their recommendation engine by representing the lyrics in vector space and arranging them into clusters, then creating a KNN classifier. They used the classifier to generate recommendations, first by selecting a cluster among the initial sample, and then selecting the closest songs in vector space based on Euclidean distance. Of the nearly 4,400 artists in their sample, they found that Frank Sinatra was the most common artist and White Christmas the most common song.

Students: Cheng Chen, Rohit Bharadwaj, Roberto Martin, Tian Tan

Ad campaign models generally performed better when contextual features were introduced. Though results varied by campaign, most saw a small improvement.

Predicting Ad-Response Rates

Online advertisers commonly use logistic regression models to match adds to customers, estimating the value of an ad opportunity by the likelihood that the ad will lead the viewer to click on the client's website or some other desired outcome. In a project for the digital ad-tech firm, MediaMath, the team built on the company’s existing ad-response model to include the interplay between information about the user's history, the client's ad campaign, and the context of the ad. 

The team then applied a machine learning technique called regularization to MediaMath’s model to try and improve its ability to generalize to new, unseen data, and make it faster to evaluate. MediaMath currently trains a separate model for each advertising campaign; the team also experimented with training models across campaigns, at different levels of granularity, and using hybrid tree-based models as inputs for a logistic regression. All of these adjustments led to a 2 to 3 percent improvement in estimating the value of an ad opportunity, they found. MediaMath processes 4 million requests each second so even a small change can impact the bottom line.

Students: Jose Ramirez Soto, Xizewen Han, Ryan Walsh, Xiyue Yu

Predicting the Next President from Opinions Expressed on Twitter 

The failure of political polls to accurately capture public sentiment was glaringly evident in the recent U.S. presidential race. In a project for Synergic Partners, the team investigated how well Twitter and traditional methods did at assessing candidate popularity. From a sample of 4 million tweets, the team labeled 7,000 of them positive, negative, or neutral, during the third debate between Hillary Clinton and Donald Trump. Building a model with their classified data, they predicted that Trump would win the election with 275 electoral votes. (He ended up winning 306 votes). Their model called 40 of 51 states correctly, and missed on states that had small margins of victory or minimal data. This approach, they said, could help improve traditional polling methods.

Students: Yaran Fan, Daniel Kost-Stephenson, Qing Xu,
Tao Yu, Yuhao Zhou

A Smarter Way to Search Bloomberg's News Archives

In a project for Bloomberg, the team used the company’s vast news archive to build a system that speeds up the process of finding a relevant article by building an auto-complete system that anticipates what the user wants as she types. To do this, they had to overcome two challenges. The first was to build a system from the news database alone, without the benefit of aggregating past searches. The second was to build a system capable of suggesting more than two words at a time, useful in phrase completion. They based their system on an n-gram model that uses machine learning to analyze the prefixes and suffixes in a query to estimate the probability of a match. Testing additional models, they included the word2vec and co-occurrence models in their final system. 

Students: Nitish Ratan Appanasamy, Vishal Juneja, Woojin Kim, Bo Zhang

Predicting when Phone Customers Are Getting Ready to Close their Account

A large body of research suggests that it costs companies more to attract new customers than keep the customers they already have. In a project for Synergic Partners, the group worked with Brazil’s largest phone company, Telefonica, to understand why some customers cancel service. They used mobile phone data, among other sources, to try and identify factors tied to customer dissatisfacation. Among the sampling stratagies and models they tried, the Random Forest method seemd to work best. 

Students:  Haiyuan Cao, Eloi Morlaas, Mengqi Wang, Yanyan Zhang

Previous capstone projects:

Capstones Explore Problems in Medical Imaging, Finance, Public Health and More, Spring 2016
Capstones Map Patent Applications, Case Law and Twitter Influencers, Fall 2015

— Kim Martineau

550 W. 120th St., Northwest Corner 1401, New York, NY 10027    212-854-5660
©2017 Columbia University