Working with corporate clients in New York and Madrid, students in our master’s in data science program presented their capstone projects last month. Two teams developed systems for analyzing international case law and the spread of information on Twitter for Synergic Partners, a big data consulting firm based in Madrid. A third team developed a system for classifying U.S. patent applications by their underlying technologies for the investment bank Goldman Sachs. “The students produced results that should prove useful to their clients and possibly others,” said computer scientist Eleni Drinea, a Columbia lecturer who taught the capstone class. The projects are summarized below.

Mapping U.S. Patent Applications

Patent applications in the United States are classified by technology in a large database but search results often leave out related technologies. In a project for an investment banking firm in New York City, this team used a topic modeling technique called Latent Dirichlet Allocation (LDA) to analyze the text of all utility patents filed in 2014 to infer their underlying themes. They found that their technique nicely complemented the U.S. Patent and Trademark Office’s classification system to provide a fuller picture of overlapping technologies. The team used Python, Kibana, Elastic Search and Shiny to explore, validate and visualize their data and results.

Students:  Gabrielle Agrocostea, Francisco Arceo, Abdus Khan, Justin Law and Tony Paek. 

The team’s algorithm classified patents into multiple categories, complementing the USPTO’s classification system. It picked up some novel categories, including “computer systems,” represented by Topic 1 above, which the team discovered had underlying ties to patents related to medicine/cancer and hardware patents (frames, rails, brackets), Topics 4 and 10 respectively.

Automating Case Law Analysis

Understanding legal precedents is critical to prosecutors and defense lawyers in plotting their strategy at trial and predicting the trial’s outcome. In a project for Synergic Partners, a consulting firm in Spain, this team used case law from the United Nations Office on Drugs and Crime (UNODC) to develop an interactive system for analyzing the data. They demonstrated that the dashboard they developed could be connected to an external database to pull in additional information to complement the UNODC’s data. The team used Python, Kibana, Elastic Search and Shiny to explore, validate and visualize their data and results.

Students: Lin He, Mandeep Singh, Bella Wang and Barbara Welsh. 

The interactive dashboard above provides an overview of the 1,940 criminal cases in the UNODC’s database and includes country in which charges were filed and highest court in which the accused faced trial. The filtered results above summarize cases involving criminal intent.

Mapping Data Science Influencers on Twitter

Twitter is full of posts tagged #BigData and #DataScience. Which are the ones that people pay attention to most? In a project for Synergic Partners, this team used network science and text-mining techniques to identify Twitter influencers in data science. They built a projected network by combining “retweet” and “mention” layers into a single layer and discovered communities using the K-Clique, Modularity, Random Walk and Mixed Membership Blockmodel community detection algorithms. They identified community influencers using centrality metrics and characterized users and communities using LDA. With a limited dataset of less than 200,000 tweets, they found that the modularity and random walk techniques produced the most coherent communities based on user demographics and influencers. An interactive visualization showed each community’s network and user demographics.

Students: Casey Huang, Claire Liu, Jordan Rosenblum and Steven Royce.

The team’s use of the random walk and modularity algorithms independently picked out the above community of 5,115 Twitter users, largely concentrated in France. These techniques, along with others the team analyzed, can help identify influencers within communities and those who bridge communities and are thus effective targets for a marketing campaign.

— Kim Martineau