Working with corporate clients in New York and Madrid, students in our master’s in data science program presented their capstone projects last month. Two teams developed systems for analyzing international case law and the spread of information on Twitter for Synergic Partners, a big data consulting firm based in Madrid. A third team developed a system for classifying U.S. patent applications by their underlying technologies for the investment bank Goldman Sachs. “The students produced results that should prove useful to their clients and possibly others,” said computer scientist Eleni Drinea, a Columbia lecturer who taught the capstone class. The projects are summarized below.
Mapping U.S. Patent Applications
Patent applications in the United States are classified by technology in a large database but search results often leave out related technologies. In a project for an investment banking firm in New York City, this team used a topic modeling technique called Latent Dirichlet Allocation (LDA) to analyze the text of all utility patents filed in 2014 to infer their underlying themes. They found that their technique nicely complemented the U.S. Patent and Trademark Office’s classification system to provide a fuller picture of overlapping technologies. The team used Python, Kibana, Elastic Search and Shiny to explore, validate and visualize their data and results.
Students: Gabrielle Agrocostea, Francisco Arceo, Abdus Khan, Justin Law and Tony Paek.
Automating Case Law Analysis
Understanding legal precedents is critical to prosecutors and defense lawyers in plotting their strategy at trial and predicting the trial’s outcome. In a project for Synergic Partners, a consulting firm in Spain, this team used case law from the United Nations Office on Drugs and Crime (UNODC) to develop an interactive system for analyzing the data. They demonstrated that the dashboard they developed could be connected to an external database to pull in additional information to complement the UNODC’s data. The team used Python, Kibana, Elastic Search and Shiny to explore, validate and visualize their data and results.
Students: Lin He, Mandeep Singh, Bella Wang and Barbara Welsh.
Mapping Data Science Influencers on Twitter
Twitter is full of posts tagged #BigData and #DataScience. Which are the ones that people pay attention to most? In a project for Synergic Partners, this team used network science and text-mining techniques to identify Twitter influencers in data science. They built a projected network by combining “retweet” and “mention” layers into a single layer and discovered communities using the K-Clique, Modularity, Random Walk and Mixed Membership Blockmodel community detection algorithms. They identified community influencers using centrality metrics and characterized users and communities using LDA. With a limited dataset of less than 200,000 tweets, they found that the modularity and random walk techniques produced the most coherent communities based on user demographics and influencers. An interactive visualization showed each community’s network and user demographics.
Students: Casey Huang, Claire Liu, Jordan Rosenblum and Steven Royce.
— Kim Martineau