Each spring and fall, the DSI Scholars program connects Columbia University students with select faculty-led projects that are seeking to apply data science methods to novel research problems. Scroll down for an overview of the program’s selected projects for Fall 2026. Detailed project descriptions, including Scholar responsibilities and qualifications, can be found below in the Project Details section.


Mapping Toxic Entanglements

School: Arts and Sciences
Department: Anthropology
Campus: Morningside

Mapping Toxic Entanglements (MTE) is a digital platform that traces the global itineraries of kepone (chlordecone), a now-banned organochlorine pesticide, from its manufacture in Hopewell, Virginia, through banana plantations in Martinique and Guadeloupe, to sites across France, Cameroon, Brazil, Poland, and beyond. Unlike existing environmental mapping tools that treat contamination as a localized hazard, MTE visualizes toxic exposure as a relational and transnational phenomenon—the product of interconnected commodity chains, imperial circuits, and regulatory asymmetries that distributed chemical harm unevenly across the globe.

The DSI Scholar will build the platform’s spatial infrastructure: georeferencing contamination data, demographic records, and regulatory boundaries for each node; implementing the graph database architecture; developing the interactive web interface with timeline and pathway navigation functionality; and integrating the GIS layer with the Omeka S archive so that spatial and archival evidence communicate seamlessly. The Scholar will work closely with the PI, whose fifteen years of ethnographic research on chlordecone provides the archival materials, theoretical framework, and community relationships that the platform digitizes. The project offers the Scholar hands-on experience building a novel relational GIS platform at the intersection of environmental justice, digital humanities, and critical data studies.


Quantifying Mitochondrial Defects

School: VP&S
Department: Neurology
Campus: CUIMC

Mitochondria produce most of the energy required for cellular function, and the organization of the inner mitochondrial membrane is a critical determinant of the efficiency and adaptability of this energy-generating machinery. Disruption of this architecture can impair oxidative phosphorylation and has been linked to severe human neurologic and neuromuscular disorders, including conditions caused by defects in ATP synthase dimerization. Despite its importance, assessment of inner mitochondrial membrane organization remains technically challenging. Electron microscopy remains the gold standard for resolving mitochondrial sub-compartments at very high spatial resolution, but it is not suited to measuring dynamic functional parameters such as membrane potential and proton distribution in living cells.

This project will develop automated approaches for the analysis of super-resolution mitochondrial imaging data. The central objective is to establish a reproducible computational framework for image segmentation, spatial feature extraction, compartment-level quantification, and comparison across experimental conditions. These data will provide a foundation for identifying previously unrecognized mechanisms of mitochondrial dysfunction and may help define quantitative cellular phenotypes that support future mechanistic studies and therapeutic strategies aimed at restoring inner mitochondrial membrane organization and function.


Mapping Children’s Vulnerability

School: Climate School
Department: Climate School
Campus: Morningside

This project launches an exploration of two separate indices, developed by two unrelated entities, for adjacent purposes. The first index is the United Nations Fund for Children (UNICEF)’s Children’s Climate Risk Index (CCRI), scheduled for updated release in 2026. That index maps where children are exposed to multiple climate hazards and then overlays child-specific vulnerabilities. The result is a visual display of climate-informed priority areas for UNICEF’s interventions, including enhanced social protection efforts and focused child welfare programming. The second index is the Climate Finance Vulnerability Index (CliF-VI), developed by Columbia University’s National Center for Disaster Preparedness (NCDP) in 2025. This index identifies countries facing significant barriers to accessing international climate finance, factoring in economic constraints such as debt sustainability.

This approach will expand the impact of UNICEF’s CCRI from identifying vulnerability to practically directing development action. Simultaneously, it will create an opportunity for the CliF-VI to be applied in a tangible setting, with actionable outcomes. The findings will highlight where high child vulnerability overlaps with deep financial constraints, drawing on available data from both indices.


How the Green Revolution Transformed Rural Energy

School: SEAS
Department: Mechanical Engineering
Campus: Morningside

The Green Revolution and agricultural investments from 1965-1985 allowed India to increase productivity faster than population growth.  This also led a wider structural transformation of livelihoods, incomes and associated infrastructure.  Using a historical lens (1960–2000), we will examine what were the underlying drivers for longer-term structural change. 

The answer to this question is crucial for much of Sub-Saharan Africa, where Green Revolution type adoption of high-yield varieties, fertilizer, irrigation, market support did not occur. The study will utilize data-science driven techniques to develop a Shift-Share Instrumental Variable (IV) design and advanced OCR/HTR methods to answer the questions. 

More critically we will develop/implement/validate tools to digitize handwritten archives which have much wider applicability for archival historical documents. We will analyze priors (e.g. district specific conditions), adoption of high-yield varieties, and outcomes such as household assets, income levels, and shifts toward urban employment, we aim to clarify to what extent electricity is a driver of transformation or a follower of agriculture practices.


Measuring Ideological Change in American Universities

School: Arts and Sciences
Department: Political Science
Campus: Morningside

Despite intense public debate about ideological trends in higher education, researchers lack systematic data on how student political expression has evolved across institutions. This project aims to fill that gap by building a comprehensive, searchable corpus of college newspaper articles spanning multiple decades and hundreds of universities. 

Student newspapers are among the few publicly accessible, student-produced records found at nearly every American college and university. Many maintain digital archives reaching back decades, offering an extraordinarily rich but underutilized source for studying campus public opinion over time. A pilot study using 25,028 articles from the Columbia Daily Spectator (2000–2025) demonstrated that large-scale scraping and LLM-based classification can identify meaningful patterns in ideological orientation and genre composition. This project scales that approach nationally. 

This project sits at the intersection of computational social science, natural language processing, and the study of higher education. It offers the DSI Scholar hands-on experience with large-scale web scraping, data pipeline design, database architecture, and text-as-data methods, while contributing to a resource with broad scholarly value well beyond any single research question.


Predicting Biological Activity from Mass Spectrometry Data

School: Arts and Sciences
Department: Chemistry
Campus: Morningside

Natural products from plants remain one of the richest sources of drug leads, yet discovering bioactive compounds within complex extracts is slow, expensive, and largely guided by trial and error. Our laboratory has assembled a unique dataset of over 300 cannabis cultivars, each profiled by untargeted LC–MS/MS (liquid chromatography–mass spectrometry) and several breeds screened in neuronal cell-based assays measuring neuroprotective activity relevant to ALS (amyotrophic lateral sclerosis). This paired chemical-biological dataset, linking mass spectral features to cell viability, neurite outgrowth, and oxidative stress, is rare in natural product research and well-suited for machine learning.

The DSI Scholar will design and implement a supervised learning pipeline that maps thousands of LC–MS/MS spectral features per sample to quantitative bioactivity scores across multiple assay readouts, a regression problem on real, labeled, high-dimensional data. This includes implementing and comparing approaches such as regularized regression, random forest, and gradient boosting, as well as building interpretable visualizations that allow chemists to understand and act on model predictions.


An AI Chatbot for Cardiovascular Risk Reduction

School: School of Nursing
Department: Office of Scholarship and Research
Campus: CUIMC

Cardiovascular disease (CVD) is the leading cause of death worldwide, with over 70% of risk attributable to modifiable factors. Sexual minority adults bear a disproportionate burden of CVD risk factors compared to heterosexual adults, driven largely by chronic exposure to minority stress—psychosocial stressors arising from marginalization, including discrimination and internalized homophobia.

Despite this evidence, no existing interventions simultaneously address minority stress and CVD risk reduction in this population. *My Heart, My Pride* is a 12-week, theory-guided, nurse-led behavioral intervention adapted from an evidence-based CVD risk reduction program and tailored to middle-aged sexual minority adults (ages 40–64). While nurse-led interventions are effective, they are costly and resource-limited, constraining scalability. This pilot study aims to inform the design, safety protocols, and implementation strategy for a RAG-based AI chatbot to augment *My Heart, My Pride*.


Accelerating Public Health Research with Inferential AI

School: Mailman School of Public Health
Department: Environmental Health Sciences
Campus: CUIMC

Traditionally, high-resolution spatiotemporal mapping and inference of diseases and deaths, critically important for public health surveillance and intervention strategies, have heavily relied on statistical frameworks such as Gaussian Processes (GPs). Despite their flexibility, off-the-shelf GPs present serious computational challenges, which limit their scalability and practical usefulness in applied settings (e.g., analyzing life expectancy or disease prevalence over time and space by census tract in the United States or surveilling estimated daily infectious disease spread). Further, in some circumstances when efficient data collection is critical, such as daily disease prevalence modeling, if such high-resolution inferences could be made at adequate speed, resulting models would enable Bayesian optimization (BO) and Active Learning (AL) techniques: areal units with large uncertainty could be targeted to update the model for the next iteration via a recently proposed Active Learning Sampling Design (A-LSD) method. 

There is therefore currently a critical gap in tools which can be used by public health researchers to perform accurate and detailed spatiotemporal inference at the speed required for continued relevance in public health and policy. Our proposal will substantially contribute to filling this gap.


Cognitive Impairment in Solid Organ Transplant Recipients

School: VP&S
Department: Neurology
Campus: CUIMC

Over 48,000 organ transplants were performed in the United States in 2024, with 415 of those transplants occurring at New York Presbyterian Hospital (NYP). Both nationally and at NYP, an increasing share of those transplant recipients are over 65 years of age. Cognitive impairment after transplant has been documented in kidney, liver, heart, and lung transplant recipients. Cognitive impairment has been shown to be associated with worse outcomes and poor quality of life in any clinical context, including transplant, and is more common in the elderly. 

Despite this robust body of work, there is no scientific consensus on how prevalent cognitive impairment in organ transplant recipients is, what the clinical syndrome involves, whether there are additional risk factors besides age, what cognitive domains are affected, and if there is any relationship between post-transplant cognitive impairment and dementia.