This spring, the Data Science Institute (DSI) at Columbia University hosted its inaugural Best Data Science Student Course Project Competition to highlight how students apply data science across a range of disciplines.
Hosted by DSI’s Education Working Group, the event showcased the work of 10 finalists chosen from among 26 faculty-nominated projects. Winners were selected based on creative use of data for an important problem, data science innovation, new cross-disciplinary insights, potential societal impact, and alignment with the Institute’s “Data for Good” mission.
One of the five winning projects, The Columbia Language Justice Perspectives Project, was led by two humanities majors, Nikita Desir, a current student, and alumna Kyra Ann Dawkins ’20. Desir and Dawkins presented work completed during the course entitled “Multilingual Language Technologies and Language Diversity,” which was co-taught by Isabelle Zaugg, a DSI postdoctoral research scholar, and Smaranda Muresan, a DSI research scientist and adjunct associate professor of computer science, and supported by the Collaboratory at Columbia program.
Both students entered the course with minimal computational skills, but incorporated natural language processing and interactive mapping into their work, according to Zaugg. “The project captured a lot of nuance, especially in terms of the variety of perspectives on multilingualism and the challenges of translation.”
Desir and Hawkins explored multilingual language experiences within the context of language justice. In phase one, they reached out to the Columbia community to crowdsource 15 “multilingual reflections” based on an original text. The reflections were written in 12 languages—Russian, Brazilian Portuguese, Mandarin, Bengali, Punjabi, Tagalog, Haitian Creole, Japanese, German, Amharic, Tamil, and Dutch. The team compared the author’s translations of the original text with results from Google Translate.
The BLEU (Bilingual Evaluation Understudy) algorithm, which measures the differences between a machine-translated text and human-translated references, was used to compare the texts. BLEU scores range from 0 to 1—the higher the score, the closer it is to a human translation. 0.15 is the minimal acceptable score for Google Translate. Desir and Dawkins found that five of 12 languages (42%) received BLEU scores below 0.15, which means that Google Translate did an unacceptable job capturing the author’s intended meaning.
For phase two, the team visualized their results through an interactive global map by using the open source tool MapHub.net. The map shows the locations that the speakers chose for their languages, and allows users to see the original text, the authors’ translations, the Google translation, the BLEU score, and the participants’ thoughts on the translation process.
Desir and Dawkins concluded that “machine translation is a ‘double-edged sword’ in the fight for language justice in the digital sphere. Considerations of language justice in translation should be engaging and accessible, while also holding machine translation technologies accountable to meeting the needs of language communities.”
Desir says her experience in the course motivated her to pursue a computer science concentration, in addition to comparative literature and society, and she considers The Columbia Language Justice Perspectives Project her first step into the field of data science. “The course demonstrated to me how quantitative and qualitative methods can be innovatively combined to answer some of society’s most pressing questions, and I realized that I wanted to be a part of this effort.”
Similarly, Dawkins indicated that she will not shy away from embracing data science the future. “This experience has shown me how crucial interdisciplinary collaboration is, and how transformative it can be in gaining new insights on pervasive issues,” she said. “In certain spheres, quantitative metrics can depict the stakes of an issue in ways that words simply cannot. By working on The Columbia Language Justice Perspectives Project, I developed a new appreciation for the leverage data can garner in mobilizing discourse.”
— Karina Alexanyan, Ph.D.