Explore cross-disciplinary data science research projects from 10 Columbia student team finalists as they compete for awards.


March 5, 2021 (11:00 AM – 1:00 PM ET) – Online Event

Hosted By

DSI Education Working GroupLearn More

About the Competition

The student teams presenting during this inaugural competition event were selected as finalists by the DSI Education Working Group committee from among many impressive nominations. Following the presentations, a panel of esteemed faculty from across Columbia’s campuses will serve as judges. Finalists will compete across several judging criteria, such as creative use of data science; potential societal impact; and alignment with DSI’s “Data for Good” mission.

This event will provide a snapshot of how Columbia students are applying data science across many disciplines including computer science, electrical engineering, statistics, public health, comparative literature, and more. Successful student teamwork, problem solving, and innovation in data science methods will be highlighted.

Join this event to hear student research presentations. Following the presentations, while judges deliberate, there will be time for our audience to engage in discussion with the students.

This event will be chaired and moderated by Tian Zheng, Professor and Department Chair of Statistics, Columbia University; and Co-Chair of the DSI Education Working Group.

Project Descriptions

Attacks on Aid: Quantifying Risk of Violence Toward Aid Workers in Global Humanitarian Settings

Attacks on aid workers in complex crises have been steadily increasing, with a record number of workers killed, kidnapped, or wounded in recent years. Through interactive data visualization and a case study on Afghanistan, our website illustrates the risks of being an aid worker in conflict settings. In characterizing violence against aid workers, our project is a spotlight that equips humanitarian organizations to better protect their staff – because saving lives should never cost lives. 

2020 US Presidential Election Exploration

The COVID-19 pandemic was the most consequential global event in the year of 2020, and it had shaped every facet of the 2020 U.S. Presidential Election. The surge in mail-in ballots made voting fraud and election integrity a serious concern. Despite many case studies had proven these concerns to be baseless by discussing how the US economic conditions, the racial unrest, and Trump’s response to COVID-19 have impacted the odds of Trump being reelected, we decided to focus on the public fundings and spendings of election campaigns and discussed whether those aspects has any power on indicating the winner of 2020 U.S. Presidential Election.

R Story – Empowering Communities for Resilience and Sustainable Growth, in partnership with the Opportunity Project (a Census Bureau initiative)

Your community has a story – feel supported when telling it. Through R Story’s interactive dashboard, rural leaders can easily access their community’s data in a format that is ready to present to entrepreneurs, developers, and future residents looking for their next location. With R Story, community leaders can leverage their data to obtain resources for their community and sustainably build their economy.

Intelligent Forecasting for COVID-19 in Collaboration with KPMG

COVID-19 has impacted us, but we can move forward with greater confidence if we know what has happened and know what would likely happen. This project entails prescriptive and descriptive understanding of COVID-19 for the 50 biggest cities and 50 states across the U.S., a tailored COVID-19 forecasting model for each region, a mobility forecasting model, and an interactive website with visualizations. Businesses can easily understand the trajectory of this pandemic for their specific city or state by utilizing the website.

The Columbia Language Justice Perspectives Project

The act of translation in all forms is one of beauty and tension, but machine translation is a true double-edged sword that can either protect or endanger digital multilingual experiences in the fight for language justice. The Columbia Language Justice Perspectives Project presents these nuanced global stakes of equity in translation through an interactive map that contextualizes multilingual reflections and indicates the discrepancies in Google Translate technology. Considerations of language justice and translation should be engaging and accessible; however, they also should hold machine translation technologies accountable to meeting the needs of language communities. 

Characterizing Walt Whitman’s Stylistic Changes in Leaves of Grass 

Walt Whitman is an iconic American poet famous for his poetry collection Leaves of Grass. It is well established by literary scholars that his style changed significantly throughout his 40 years long writing career. Applying appropriate quantitative tools (natural language processing) to various editions of Leaves of Grass, we obtain impressive results regarding Whitman’s stylistic changes in many respects. These tools extend easily to a large number of literary works and enable people to read books more efficiently and accurately.

Perovskite Stability Prediction

The discovery and deployment of novel materials are critical to scaling any sustainable energy endeavor, such as converting sunlight to energy or making batteries to store such energy. For example, most solar panels are currently made of silicon, which is not efficient enough to realistically meet the energy demands of large cities. Additionally, huge batteries would be needed to store enough energy for large-scale solar energy to be practical. Materials also need to be cheap, light, and nontoxic. While machine learning approaches have been criticized as being too much of a “black box,”—i.e. their use does not lead to a better understanding of the science— the surge in the paper using machine learning to predict material has provided evidence of their effectiveness.​ Our project, Perovskite Stability Prediction, mentored by Dr. Billinge of the Applied Physics, Applied Mathematics, and Materials Science Department, aims to use machine learning techniques to find better materials for sustainable energy applications.

Network Characterization of Phishing Attacks

Phishing attacks are one of the most widespread and persistent threats to cybersecurity. By impersonating trustworthy entities, attackers trick victims into disclosing sensitive information such as passwords and credit card information. Current methods to deal with phishing are well-known to most attackers and can be easily bypassed. Identifying and understanding network-level characteristics of phishing emails is key to modernized cybersecurity, providing a novel and more robust method of preventing phishing emails from reaching vulnerable targets.

Estimating the Incidence of Sexual Assault on College Campuses

Each year, US colleges and universities are required to disclose the number of reported sexual assaults on their campuses. However, sexual assault is widely believed to be underreported, and the number of reported assaults could arise from any combination of reporting rate and true total number of assaults. This project aims to disentangle those two values, allowing plausible estimates of the total number of assaults occurring in a given year. Such estimates improve the interpretability of campus crime statistics, and may help inform policy decisions regarding campus initiatives for sexual assault prevention and awareness.  

NYC Employment Analysis 

Description coming soon