The student teams that presented during the second annual competition event were selected as finalists by the DSI Education Working Group committee from among many impressive nominations. Following the presentations, a panel of researchers from across Columbia’s campuses served as judges. Finalists competed across several judging criteria that emphasized cross-disciplinary data science research and innovation in data science methods.
The event provided a snapshot of how Columbia students are applying data science across many practices, including engineering, economics, health, public policy and more. Successful student teamwork and problem solving were highlighted by each of the finalists.
The competition was chaired and moderated by Jeff Goldsmith, Associate Professor of Biostatistics, Mailman School of Public Heath; and Co-Chair of the DSI Education Working Group. Committee member Isabelle A. Zaugg led a group discussion during the event.
The 5 Winning Teams
The below teams were recognized in the competition for excellence across our judging criteria: creative use of data for an important problem; data science innovations; new cross-disciplinary insights; potential societal impact; and alignment with DSI’s “Data for Good” mission. The teams are listed below in no particular order:
Understanding How Readers Comprehend Visualizations With Captions of Different Semantic Level Content
Trend R: Digital Tool To Enable Decision-making Using Census Data
Script Key: An Image-Based Keyboard for Non-encoded Alphabetic Scripts
COVID and CitiBike Usage
Don’t Go Far Off: An Empirical Study on Neural Poetry Translation
Short Description: Affordable housing is a core basic need for families and it is often one of the most important financial decisions. Following the 2008 mortgage crisis, it is important for society to better understand how the housing market changed and the economic variables that play a part in these shifts. We published a report to educate and inform buyers, builders, and policymakers on our real estate market.
COVID and CitiBike Usage
Team Members: Jessica Rosenberg, Rose Killian, Matthew Russell, Zoe Verzani, Christina Zettler
Instructor: Jeff Goldsmith
Short Description: COVID-19 had an extreme impact on transportation patterns in NYC. In our project, we used R to better understand how transportation use and travel patterns shifted with the onset of the pandemic by examining CitiBike and MTA data. In our work, we also conducted exploratory analyses investigating spatial and demographic trends in transportation use across NYC.
Trend R: Digital Tool To Enable Decision-making Using Census Data
Team Members: Bo Crauwels, Hanzhang Hu, Shreyans Kothari, Dan Li, Pengyun Li, Ru Lu
Instructor: Aracelis Torres
Short Description: TrendR is a multi-functional user dashboard that enables community organizations to visualize changes and trends in data through maps and graphs. It aims to empower community organizations to utilize the potential of publicly available resources to facilitate data-driven decision making.
Don’t Go Far Off: An Empirical Study on Neural Poetry Translation
Team Members: Tuhin Chakrabarty, Arkadiy Saakyan
Instructor: Smaranda Muresan
Short Description: Despite constant improvements in machine translation quality, automatic poetry translation remains a challenging problem due to the lack of open-sourced parallel poetic corpora, and to the intrinsic complexities involved in preserving the semantics, style and figurative nature of poetry. We present an empirical investigation for poetry translation along several dimensions: 1) size and style of training data (poetic vs. non-poetic), including a zero-shot setup; 2) bilingual vs. multilingual learning; and 3) language-family-specific models vs. mixed-language-family models. To accomplish this, we contribute a parallel dataset of poetry translations for several language pairs. Our results show that multilingual fine-tuning on poetic text significantly outperforms multilingual fine-tuning on non-poetic text that is 35X larger in size, both in terms of automatic metrics (BLEU, BERTScore, COMET) and human evaluation metrics such as faithfulness (meaning and poetic style). Moreover, multilingual fine-tuning on poetic data outperforms bilingual fine-tuning on poetic data.
Identifying Patterns of Two Types of Principal Turnover in the USA and Singapore Using Decision Trees
Team Member: Jasmine Shi
Instructor: Alex Bowers
Short Description: Principal turnover has become a severe concern in the United States because of its adverse impacts on teachers and students. However, few studies have focused on analyzing the relative importance of predictors when predicting principal turnover, and the cross-cultural comparison of principals’ turnover patterns is even scarce. The project takes a machine learning approach to describe two types of principals’ turnover patterns in the United States (N = 166) and Singapore (N = 169) using the Teaching and Learning International Survey 2018 (TALIS 2018) data.
Measuring the Integration and Network Effect of the SDGs
Team Members: Peishan Li, Qinyue Hao, Jasmine Hwang, Dan Li, Rina Shin, Ye Xu, Hanyu Zhang, Lizabeth Singh
Instructor: Charles Riemann
Short Description: This project measured the success and linkages between the United Nation’s Sustainable Development Goals (SDG) as interactive networks. To do this, we developed (1) a text-based network model, and (2) a coefficient-based model using the SDG Indicator Database in order to identify connections and interdependencies between goals.
Mental Health in Athletes
Team Members: Katharina Fijan, Erin Donnelly
Instructor: Joyce Robbins
Short Description: This project is an exploration of mental health among athletes of various sports with different levels of competitiveness, stigmas, and teamwork. Using an open-source dataset, we sought to quantify aspects of mental health per sport and sport groupings to identify potential at-risk populations and inspire healthy change in their communities.
Script Key: An Image-Based Keyboard for Non-Encoded Alphabetic Scripts
Team Members: Eve Suane Loomis Washington, Gabriela Arredondo, Megan Fleurine St. Hilaire, David Rosado
Instructor: Isabelle Zaugg
Short Description: This project sought to develop a tool that would enable users to send messages using unencoded scripts. Script Key allows for generating a library of symbol images, not only for use in messages, but also for crowdsourcing examples of a script’s use that are required for Unicode standardization.
Understanding How Readers Comprehend Visualizations With Captions of Different Semantic Level Content
Team Members: Hazel Zhu, Shelly Cheng
Instructor: Eugene Wu
Short Description: This project investigates how readers gather takeaways of a visualization when captions are expressed in different semantic levels. To do this, we first asked one group of participants to identify visually prominent regions in a set of single and multi-line charts. Then we generated captions based on the visually prominent features and a four-level model of semantic content. Lastly, we ask another group of participants through questionnaires to report their takeaways after observing chart-captions pairs. With this project, we hope to provide insights into effective caption writing so that the information of a visualization can be better delivered to the readers.
One of last year’s projects, R Story, recently won two prizes ($10,000 x 2) from the US Census Bureau for the Best Tool for Equity and Inclusion and the Student Prize in the category of Climate, Resilience, and the Natural Environment.