DSI Students Conduct Data-Driven Research with Industry Partners

April 19, 2018

In their final semester of the master’s program, DSI students take a Capstone class, where they divide into teams and work on data-science projects that give them invaluable hands-on experience in the field.

The teams work under the direction of industry affiliates, who guide them through the entire cycle of how to use data science to solve real-world problems. The DSI affiliates, a mix of corporations, government entities and non-profit agencies, help the teams define their projects, establish their goals and set deadlines. Sometimes a Capstone project will prove so successful that the affiliate will use the team’s findings to help improve data analytics at their company or organization. The students work on the projects for the entire semester, at the end of which they present their findings to their professors, affiliates and fellow students. And given the number of talented students at DSI, it’s no surprise that the final projects are impressive.

“I was delighted with the results of their projects and know that this experience will serve them well when they graduate from DSI,” said DSI Professor Andreas Mueller, who taught one of the Capstone classes.

For their part, the students say they gain experience in demonstrating and sharpening their data-science skills – experience that serves them well when it comes time for them to interview for jobs. As it is, students who study data science are in high demand, but acquiring hand-on experience enhances their marketability, since they can discuss their projects with recruiters.

“Working on the Capstone project gave us a complete understanding of the responsibilities of a data scientist,” said Sanjmeet Abrol, who belonged to a team in Professor Eleni Drinea’s class that built a recommendation system for Instacart, the online grocery firm. The system had two intents: to recommend products to customers and to create a predictive tool for Instacart that shows when the customer orders would arrive and what products would be in the orders.

The overall architecture for an internal analytics dashboard

The team used data from The Instacart Online Grocery Shopping Dataset 2017, which contained a sample of more than three million orders from 200,000 Instacart users. They found that products from the fresh fruits and fresh vegetables aisles appeared most frequently in orders, and that bananas were the most popular product. They also observed that organic products were highly popular among Instacart customers.

The team’s findings will help Instacart build a user-product recommendation system and provide customer segmentation that will help the company plan marketing campaigns.

Team: Sanjmeet Abrol, Alexander Castleton, Kaavya Chinniah, Tarek Elelimy, Gregory Johnsen

Industry affiliate: Synergic Partners

Using Data Science to Help Police Combat Crime

Can data scientists use crime statistics to unearth patterns that will help law enforcement officials reduce crime in New York City? Could the crime data, moreover, be used to predict where and when crimes are mostly likely to occur?

For their project, this team sought to answer these questions by compiling statistics from New York City’s Police Department’s Historic complaint data, which includes felonies, misdemeanors, and violations reported to police from 2006 to 2016.

Crime prediction model and visualization application for New York City

The team focused on seven felonies: homicide, rape, robbery, burglary, grand larceny, grand larceny auto and felonious assault as well as six offences: weapon possession, arson, fraud, forgery, drug offenses and criminal mischief. It was a large dataset – nearly two million felonies – and to make it more manageable the team created an interactive crime map of New York City. The map shows the occurrence of crime for the 10 years (2006-2016) covered in the police data. They also designed an interface so that users could retrieve data on the type of crime, the location, and the time it occured.

They were not able to predict where crime might occur but their crime map, along with their interface, could help police gain an historical understanding of crime in the city. And police could then use that understanding to decide where to deploy anti-crime units. Citizens could also use the team’s system to see where crimes are mostly likely to occur and what neighborhoods have the most crime.

Team: Panpan Cheng, Vibhuti Mahajan, Franck Ngamkan, Jared Samet

Industry Affiliate: Synergic Partners

Impact of Amazon Warehouses on Surrounding Areas

This team partnered with Capital One to study the impact Amazon facilities have on surrounding areas. The group analyzed two datasets: The first was publicly available data on the location of Amazon warehouses, and the other was economic and demographic indicators from the U.S. Census Bureau.

The impact of Amazon warehouse on surrounding areas

Using specific indicators such as population, income and age of residents for a three-year period (2012 and 2015), the team compared how counties with Amazon warehouses differed from counties without Amazon warehouses. They found that the presence of the warehouses had little to no effect on counties.

Counties without an Amazon warehouse progressed in a similar way economically and demographically as those counties without a warehouse. Additionally, they found it’s possible to create a predictive model that could determine where Amazon was likely to open a new warehouse – a matter very much in the news these days.

Team members: Alex Wainger, Sam Somuah, Jessica Raab and Qiong Hu

Industry affiliate: Capital One

Estimating demand for taxis at airports

This team evaluated the supply and demand of taxi service at LaGuardia Airport, and what they learned gave them pause: most taxis end up waiting three hours at the airport to pick up a passenger. The long wait makes drivers reluctant to come to the airport, which in turn makes it even harder (and longer) for passengers to get a cab.

Using data on incoming flights, weather patterns and departure times, the team trained a neural network to predict the demand for taxi pickups at LaGuardia Airport. And the bigger question the team addressed was this: In terms of the passengers arriving at the airport, what percentage choose a taxi over other modes of transportation? Based on input from their partners at the Taxi and Limousine Commission, the team estimated that 15 to 20 percent of passengers arriving on ﬂights at the airport choose taxis as their mode of transpor

The team’s model can be used by the Taxi and Limousine Commission to accurately predict taxi pickup demand at LaGuardia Airport, thereby improving service for both drivers and passengers. The model can also be used as a case study to justify better data-sharing practices between the Taxi and Limousine Commission and the Port Authority.

Team: Adam Coviensky, Anuj Katiyal, Keerti Agrawal and Will Geary

Adviser: NYC Taxi and Limousine Commission

Understanding water quality and composition in South Africa

This team partnered with Unilever, a consumer goods company, in identifying datasets that characterize the composition of water in South Africa, a high-business region for the company.

The aim of the project was to create a dashboard showing water-monitoring stations and metrics for pH, hardness of water and electrical conductivity. The team analyzed datasets from the Department of Water and Sanitation of South Africa. The group also evaluated data from Gemstat, a dataset with details on water quality at both global and local levels. They found that the department has 3,385 water-quality monitoring stations positioned throughout the country and that the stations monitor concentration levels of 16 elements found in the waterways.

The successful project is still continuing and the team is working with Unilever on how it can best use the dashboard to get clean water for use in the production of its products.

Team: Conrad De Peuter, Baran Akyol, Haiqi Zhu and Xikai Chen

Industry Affiliate: Unilever

Analyzing H1-B Visas and the Demand for Foreign Workers

This team studied how the strength of the American economy affects the demand for H1-B visas. The visa program enables American companies to temporarily employ foreign workers who have specialised skills, resulting in large number of applications each year.

To analyze the fluctuation demand, the team used datasets from Labor Certification Applications for the years 2001 to 2017. They also developed time-series models that predicted the number of applications filed for three years, and indicated which industries filed them.

The team found that visa caps and the availability of alternative visa classifications affected the demand for H1-B visas. The Optional Practical Training (OPT) extension for STEM degrees, for instance, gives much flexibility to employers and job seekers, thereby altering the demand scale in the H1-B visa category.

They also determined that there was a dip in the visa demand due to recession in the years 2009-2012. In 2009, moreover, there was a drop in the number of workers requested due to a new system of visa caps.

The team successfully evaluated the demand for H1-B visas and even created an interactive application with visualizations to illustrate how the demand for the visas fluctuates.

Team: Pawel Buczak, Tejas Dharamsi, Daryl Kang, Hung Shi Lin and Marika Lohmus

Industry Affiliate: Goldman Sachs

Passenger behavioral pattern in Madrid’s Metro

In this project, the team profiled passenger behaviors in Madrid’s underground metro stations and studied the commuting patterns of metro riders in different stations.

Patterns in passenger ridership and station traffic in Madrid, Spain

For their main dataset, they used the metro ticket validation for May 2016. One challenge the team faced was the absence of exit-data from passengers, which would have help them to discern commuting patterns. Additionally, it was difficult to identify transfer passengers without their destination data, and these passengers often constituted a group with interesting characteristics. The team, however, came up with solutions for this missing data by using turnstile data to supplement ticket-validation data.

The team performed segmentation analyses to understand how passengers select and prefer certain stations. The final outcome for them was clustering behavioral patterns in the form of maps and charts. The team believes that its work can help metro riders better plan their trips as well as help marketing companies to target specific audiences.