DSI Scholars Projects

Each spring and fall, the DSI Scholars program connects Columbia University students with select faculty-led projects that are seeking to apply data science methods to novel research problems. Scroll down for an overview of the program’s selected projects for Fall 2026. Detailed project descriptions, including Scholar responsibilities and qualifications, can be found below in the Project Details section.

Mapping Toxic Entanglements

School: Arts and Sciences
Department: Anthropology
Campus: Morningside

Mapping Toxic Entanglements (MTE) is a digital platform that traces the global itineraries of kepone (chlordecone), a now-banned organochlorine pesticide, from its manufacture in Hopewell, Virginia, through banana plantations in Martinique and Guadeloupe, to sites across France, Cameroon, Brazil, Poland, and beyond. Unlike existing environmental mapping tools that treat contamination as a localized hazard, MTE visualizes toxic exposure as a relational and transnational phenomenon—the product of interconnected commodity chains, imperial circuits, and regulatory asymmetries that distributed chemical harm unevenly across the globe.

The DSI Scholar will build the platform’s spatial infrastructure: georeferencing contamination data, demographic records, and regulatory boundaries for each node; implementing the graph database architecture; developing the interactive web interface with timeline and pathway navigation functionality; and integrating the GIS layer with the Omeka S archive so that spatial and archival evidence communicate seamlessly. The Scholar will work closely with the PI, whose fifteen years of ethnographic research on chlordecone provides the archival materials, theoretical framework, and community relationships that the platform digitizes. The project offers the Scholar hands-on experience building a novel relational GIS platform at the intersection of environmental justice, digital humanities, and critical data studies.

Quantifying Mitochondrial Defects

School: VP&S
Department: Neurology
Campus: CUIMC

Mitochondria produce most of the energy required for cellular function, and the organization of the inner mitochondrial membrane is a critical determinant of the efficiency and adaptability of this energy-generating machinery. Disruption of this architecture can impair oxidative phosphorylation and has been linked to severe human neurologic and neuromuscular disorders, including conditions caused by defects in ATP synthase dimerization. Despite its importance, assessment of inner mitochondrial membrane organization remains technically challenging. Electron microscopy remains the gold standard for resolving mitochondrial sub-compartments at very high spatial resolution, but it is not suited to measuring dynamic functional parameters such as membrane potential and proton distribution in living cells.

This project will develop automated approaches for the analysis of super-resolution mitochondrial imaging data. The central objective is to establish a reproducible computational framework for image segmentation, spatial feature extraction, compartment-level quantification, and comparison across experimental conditions. These data will provide a foundation for identifying previously unrecognized mechanisms of mitochondrial dysfunction and may help define quantitative cellular phenotypes that support future mechanistic studies and therapeutic strategies aimed at restoring inner mitochondrial membrane organization and function.

Mapping Children’s Vulnerability

School: Climate School
Department: Climate School
Campus: Morningside

This project launches an exploration of two separate indices, developed by two unrelated entities, for adjacent purposes. The first index is the United Nations Fund for Children (UNICEF)’s Children’s Climate Risk Index (CCRI), scheduled for updated release in 2026. That index maps where children are exposed to multiple climate hazards and then overlays child-specific vulnerabilities. The result is a visual display of climate-informed priority areas for UNICEF’s interventions, including enhanced social protection efforts and focused child welfare programming. The second index is the Climate Finance Vulnerability Index (CliF-VI), developed by Columbia University’s National Center for Disaster Preparedness (NCDP) in 2025. This index identifies countries facing significant barriers to accessing international climate finance, factoring in economic constraints such as debt sustainability.

This approach will expand the impact of UNICEF’s CCRI from identifying vulnerability to practically directing development action. Simultaneously, it will create an opportunity for the CliF-VI to be applied in a tangible setting, with actionable outcomes. The findings will highlight where high child vulnerability overlaps with deep financial constraints, drawing on available data from both indices.

How the Green Revolution Transformed Rural Energy

School: SEAS
Department: Mechanical Engineering
Campus: Morningside

The Green Revolution and agricultural investments from 1965-1985 allowed India to increase productivity faster than population growth. This also led a wider structural transformation of livelihoods, incomes and associated infrastructure. Using a historical lens (1960–2000), we will examine what were the underlying drivers for longer-term structural change.

The answer to this question is crucial for much of Sub-Saharan Africa, where Green Revolution type adoption of high-yield varieties, fertilizer, irrigation, market support did not occur. The study will utilize data-science driven techniques to develop a Shift-Share Instrumental Variable (IV) design and advanced OCR/HTR methods to answer the questions.

More critically we will develop/implement/validate tools to digitize handwritten archives which have much wider applicability for archival historical documents. We will analyze priors (e.g. district specific conditions), adoption of high-yield varieties, and outcomes such as household assets, income levels, and shifts toward urban employment, we aim to clarify to what extent electricity is a driver of transformation or a follower of agriculture practices.

Measuring Ideological Change in American Universities

School: Arts and Sciences
Department: Political Science
Campus: Morningside

Despite intense public debate about ideological trends in higher education, researchers lack systematic data on how student political expression has evolved across institutions. This project aims to fill that gap by building a comprehensive, searchable corpus of college newspaper articles spanning multiple decades and hundreds of universities.

Student newspapers are among the few publicly accessible, student-produced records found at nearly every American college and university. Many maintain digital archives reaching back decades, offering an extraordinarily rich but underutilized source for studying campus public opinion over time. A pilot study using 25,028 articles from the Columbia Daily Spectator (2000–2025) demonstrated that large-scale scraping and LLM-based classification can identify meaningful patterns in ideological orientation and genre composition. This project scales that approach nationally.

This project sits at the intersection of computational social science, natural language processing, and the study of higher education. It offers the DSI Scholar hands-on experience with large-scale web scraping, data pipeline design, database architecture, and text-as-data methods, while contributing to a resource with broad scholarly value well beyond any single research question.

Predicting Biological Activity from Mass Spectrometry Data

School: Arts and Sciences
Department: Chemistry
Campus: Morningside

Natural products from plants remain one of the richest sources of drug leads, yet discovering bioactive compounds within complex extracts is slow, expensive, and largely guided by trial and error. Our laboratory has assembled a unique dataset of over 300 cannabis cultivars, each profiled by untargeted LC–MS/MS (liquid chromatography–mass spectrometry) and several breeds screened in neuronal cell-based assays measuring neuroprotective activity relevant to ALS (amyotrophic lateral sclerosis). This paired chemical-biological dataset, linking mass spectral features to cell viability, neurite outgrowth, and oxidative stress, is rare in natural product research and well-suited for machine learning.

The DSI Scholar will design and implement a supervised learning pipeline that maps thousands of LC–MS/MS spectral features per sample to quantitative bioactivity scores across multiple assay readouts, a regression problem on real, labeled, high-dimensional data. This includes implementing and comparing approaches such as regularized regression, random forest, and gradient boosting, as well as building interpretable visualizations that allow chemists to understand and act on model predictions.

An AI Chatbot for Cardiovascular Risk Reduction

School: School of Nursing
Department: Office of Scholarship and Research
Campus: CUIMC

Cardiovascular disease (CVD) is the leading cause of death worldwide, with over 70% of risk attributable to modifiable factors. Sexual minority adults bear a disproportionate burden of CVD risk factors compared to heterosexual adults, driven largely by chronic exposure to minority stress—psychosocial stressors arising from marginalization, including discrimination and internalized homophobia.

Despite this evidence, no existing interventions simultaneously address minority stress and CVD risk reduction in this population. *My Heart, My Pride* is a 12-week, theory-guided, nurse-led behavioral intervention adapted from an evidence-based CVD risk reduction program and tailored to middle-aged sexual minority adults (ages 40–64). While nurse-led interventions are effective, they are costly and resource-limited, constraining scalability. This pilot study aims to inform the design, safety protocols, and implementation strategy for a RAG-based AI chatbot to augment *My Heart, My Pride*.

Accelerating Public Health Research with Inferential AI

School: Mailman School of Public Health
Department: Environmental Health Sciences
Campus: CUIMC

Traditionally, high-resolution spatiotemporal mapping and inference of diseases and deaths, critically important for public health surveillance and intervention strategies, have heavily relied on statistical frameworks such as Gaussian Processes (GPs). Despite their flexibility, off-the-shelf GPs present serious computational challenges, which limit their scalability and practical usefulness in applied settings (e.g., analyzing life expectancy or disease prevalence over time and space by census tract in the United States or surveilling estimated daily infectious disease spread). Further, in some circumstances when efficient data collection is critical, such as daily disease prevalence modeling, if such high-resolution inferences could be made at adequate speed, resulting models would enable Bayesian optimization (BO) and Active Learning (AL) techniques: areal units with large uncertainty could be targeted to update the model for the next iteration via a recently proposed Active Learning Sampling Design (A-LSD) method.

There is therefore currently a critical gap in tools which can be used by public health researchers to perform accurate and detailed spatiotemporal inference at the speed required for continued relevance in public health and policy. Our proposal will substantially contribute to filling this gap.

Cognitive Impairment in Solid Organ Transplant Recipients

School: VP&S
Department: Neurology
Campus: CUIMC

Over 48,000 organ transplants were performed in the United States in 2024, with 415 of those transplants occurring at New York Presbyterian Hospital (NYP). Both nationally and at NYP, an increasing share of those transplant recipients are over 65 years of age. Cognitive impairment after transplant has been documented in kidney, liver, heart, and lung transplant recipients. Cognitive impairment has been shown to be associated with worse outcomes and poor quality of life in any clinical context, including transplant, and is more common in the elderly.

Despite this robust body of work, there is no scientific consensus on how prevalent cognitive impairment in organ transplant recipients is, what the clinical syndrome involves, whether there are additional risk factors besides age, what cognitive domains are affected, and if there is any relationship between post-transplant cognitive impairment and dementia.

Project Details

School: Arts and Sciences
Department: Anthropology
Campus: Morningside

Project Overview:
The platform’s organizing principle is itinerary rather than location. Each site is connected by the routes through which kepone traveled: commodity chains, shipping routes, regulatory circuits, migration pathways, and the bioaccumulative movement of the chemical through soil, water, food, and bodies. This requires a graph database architecture (Neo4j) that models kepone as a network of nodes (sites) and edges (routes of movement, regulatory relationships, ownership chains), integrated with an interactive GIS front end (Leaflet/MapLibre) and a digital archive (Omeka S).

DSI Scholar Responsibilities:

• Build and configure the Neo4j graph database architecture, modeling kepone’s itinerary as a network of nodes (Hopewell, Martinique, Cameroon, Brazil, metropolitan France, Poland) and edges (commodity chains, regulatory relationships, migration pathways, ownership chains)

• Georeference spatial data for each node: contamination maps, demographic data, industrial land use records, and regulatory boundaries using QGIS

• Develop the interactive web-based GIS front end using Leaflet or MapLibre, including timeline functionality and pathway navigation

• Integrate the GIS layer with the Omeka S digital archive so users can move between map and archival evidence

• Implement the multi-register toggle allowing users to switch between views organized by spatial contamination data, commodity chain documentation, regulatory records, epidemiological studies, and oral testimonies

• Document code and architecture decisions for handoff and future development

CANDIDATE REQUIREMENTS

Required Skills:

• Proficiency in Python or JavaScriptE

• Experience with GIS tools (QGIS, Leaflet, or MapLibre)

• Familiarity with graph databases (Neo4j preferred) or willingness to learn

• Web development skills (HTML/CSS/JS, ideally a framework like React or Vue)

Preferred Skills:

• Experience with Omeka S or similar digital humanities platforms

• Familiarity with relational/network data modeling

• Interest in environmental justice, digital humanities, or critical data studies

Student Level:

• Master’s

• Senior

• Junior
School: VP&S
Department: Neurology
Campus: CUIMC

Project Overview:
My research focuses on the application of modern super-resolution microscopy to visualize the relationship between mitochondrial architecture and energetic state in living cells. Using control and disease-model cell lines with ATP synthase dimerization defects, as well as acute metabolic stress paradigms, we have generated imaging datasets that capture both structural features of the inner mitochondrial membrane and functional readouts related to membrane potential. Preliminary findings suggest that dimerization defects are associated with reproducible abnormalities in the spatial organization of these signals, but extracting the full biological significance of these datasets requires robust and scalable computational analysis.

DSI Scholar Responsibilities:
The DSI Scholar will contribute to the development of a computational workflow for the analysis of super-resolution mitochondrial imaging datasets. At the beginning of the project, the student will familiarize themself with the structure of the image files, associated metadata, and the biologic context of the experimental conditions, including control cell lines, disease models with ATP synthase dimerization defects, and datasets generated under acute metabolic stress.

A primary responsibility will be to help develop and refine automated feature extraction pipelines using Python or similar computational tools. This will include working with image-derived data to quantify relevant structural and functional features, organize outputs into analyzable datasets, and improve reproducibility and scalability of the analysis workflow. The student will also assist in data cleaning, quality control, and integration of imaging features with experimental metadata.

In addition, the scholar will perform comparative statistical analyses across experimental groups to identify patterns associated with distinct energetic states and disease conditions. Depending on progress and dataset maturity, the project may also include implementation of predictive modeling or machine learning approaches to classify mitochondrial energy states and distinguish disease versus control phenotypes based on quantitative image-derived features.

CANDIDATE REQUIREMENTS

Required Skills:

• Applicants should have basic coding experience in Python or a similar language, along with a working knowledge of basic statistics and quantitative data analysis.

• Experience with data visualization, data cleaning, and handling structured datasets is preferred.

• Familiarity with image analysis tools such as ilastik or FIJI, as well as dimensionality reduction or exploratory analytic approaches such as PCA, UMAP, or related methods, would be advantageous but is not required.

• The ideal student will be detail-oriented, organized, and interested in applying data science approaches to translational biomedical research.

Student Level

• Master’s

• Senior
School: Climate School
Department: Climate School
Campus: Morningside

Project Overview:
By conducting a cross-index correlation analysis, this project will be able to identify geographic locations facing significant gaps in adaptive capacity. It seeks to determine whether the countries with the most acute child-specific risks have the capacity to access the development and adaptation funds required to mitigate them.

This analysis will be the first step toward creating a decision-support tool that identifies countries where children are most at risk from multiple hazards and where climate finance is most urgently needed. We aim to translate this initial research into a framework that will inform development partners and/or governments with an integrated (economic, climate, and ethical) argument to ensure investments are pragmatic and targeted where they are most urgently needed.

DSI Scholar Responsibilities:
The student will begin with data discovery and exploratory data analysis (EDA) to gain familiarity with both datasets, collaborating directly with leading scholars in NCDP at Columbia University and the leading CCRI expert at UNICEF. From there, the student will conduct an interoperability assessment to evaluate structural barriers and identify synergies for merging the two distinct datasets. To achieve this integration, the student will use advanced analytics – executing data harmonization pipelines, followed by cross-dataset correlation analysis, and statistical modeling. Finally, the student will ensure methodological validation by running sensitivity analyses to systematically evaluate how different harmonization strategies impact the downstream statistical models.

CANDIDATE REQUIREMENTS

Required Skills:

• Python: Proficiency for data pipeline development and automation.

• Geospatial Analysis: Advanced experience in spatial statistics and hazard modeling.

• Open-Source GIS Libraries: Proficiency in the Python geospatial stack, specifically GeoPandas and Shapely for vector operations; Rasterio and Xarray for raster processing; and PySAL for spatial econometric analysis.

• Cloud Computing: Familiarity with Google Earth Engine/GCP and Azure for handling large-scale environmental datasets.

Student Level:

• Master’s
School: SEAS
Department: Mechanical Engineering
Campus: Morningside

Project Overview:
The Green Revolution and agricultural investments from 1965-1985 allowed India to increase productivity faster than population growth. This also led a wider structural transformation of livelihoods, incomes and associated infrastructure. Using a historical lens (1960–2000), we will examine what were the underlying drivers for longer-term structural change. The answer to this question is crucial for much of Sub-Saharan Africa, where Green Revolution type adoption of high-yield varieties, fertilizer, irrigation, market support did not occur. The study will utilize data-science driven techniques to develop a Shift-Share Instrumental Variable (IV) design and advanced OCR/HTR methods to answer the questions.

More critically we will develop/implement/validate tools to digitize handwritten archives which have much wider applicability for archival historical documents. We will analyze priors (e.g. district specific conditions), adoption of high-yield varieties, and outcomes such as household assets, income levels, and shifts toward urban employment, we aim to clarify to what extent electricity is a driver of transformation or a follower of agriculture practices.

CANDIDATE REQUIREMENTS

Required Skills and Experience:

• Proficiency in Python (OCR pipelines/CV) and R or Stata (econometrics).

• Experience with unstructured data or OCR libraries (Tesseract, OpenCV).

• Understanding of Instrumental Variables (IV) and causal inference.

• Strong attention to detail for data validation of historical records.

• Experience in geo-spatial analysis

• Familiarity, but no expertise, with domain knowledge

Student Level

• Master’s
School: Arts and Sciences
Department: Political Science
Campus: Morningside

Project Overview:
The DSI Scholar will help develop a robust, generalizable web-scraping pipeline capable of extracting article text and metadata from diverse college newspaper websites and archive platforms. The resulting corpus will be stored in a structured, searchable database that allows filtering by institution, keyword, time period, and other metadata fields. The long-term goal is to produce a public research infrastructure – a single repository where scholars across disciplines can access historical college newspaper text for research on campus discourse, student activism, media framing, and political change.

DSI Scholar Responsibilities:
The DSI Scholar will be responsible for the following tasks over the course of the semester:

• Landscape audit of college newspaper archives. The Scholar will survey and catalog digital archives across a target list of universities, documenting URL structures, archive platforms (e.g., WordPress, custom CMS, library-hosted databases), access restrictions, and available metadata fields. This audit will inform the design of a generalizable scraping strategy.

• Development of a scalable web-scraping pipeline. Building on the pilot scraper used for the Columbia Daily Spectator, the Scholar will develop a modular Python-based pipeline (using tools such as Playwright, BeautifulSoup, and Scrapy) capable of handling diverse site architectures. The pipeline should extract article text, publication date, author, section/genre tags, and URL, with built-in error handling, checkpointing, and logging for long-running scrapes.

• Database design and construction. The Scholar will design and implement a structured database (e.g., PostgreSQL or SQLite) to store the scraped corpus with consistent metadata schema across institutions. The database should support efficient querying by institution, keyword, date range, and other fields.

• Initial corpus collection. The Scholar will execute the pipeline on a first wave of target newspapers, prioritizing institutions that vary by type (e.g., elite private, public flagship, liberal arts, religious), region, and archive depth. Quality checks will be conducted to ensure text integrity and metadata consistency.

• Documentation and reproducibility. The Scholar will produce clear documentation of all code, data collection procedures, and database schema to ensure the pipeline can be maintained, extended to additional institutions, and used by other researchers in future stages of the project.

Depending on progress, the Scholar may also assist with preliminary text classification (e.g., genre tagging using LLMs) or the development of a simple search interface for the corpus.”

CANDIDATE REQUIREMENTS

Required Skills:

• Proficiency in Python, including experience with web scraping (e.g., BeautifulSoup, Scrapy, Playwright, or Selenium). Familiarity with database design (SQL or equivalent).

• Experience working with APIs, particularly LLM APIs (e.g., OpenAI), is a plus but not required.

• The ideal candidate is comfortable with messy, real-world data collection challenges—broken links, inconsistent HTML structures, rate limits – and can work independently to troubleshoot and iterate.

• Interest in political science, media, or text-as-data methods is welcome but not necessary.

Student Level:

• Master’s

• Senior

• Junior
School: Arts and Sciences
Department: Chemistry
Campus: Morningside

Project Overview:

The central challenge of this work is that 75–85% of molecular features detected by LC–MS/MS cannot be matched to known compounds in public databases. Rather than treating this as a barrier, we treat it as the central question: can a model learn which patterns in the spectral data predict biological activity, even without knowing the underlying molecular structures?

All experimental data have already been generated, enabling the student to focus entirely on modeling and evaluation. Model outputs will directly inform which HPLC fractions are prioritized for follow-up bioassays, connecting computational predictions to experimental decisions.

This approach could be extended to other systems where biological activity can be measured, but the underlying chemistry is not fully characterized. All code and data will be made publicly available.

DSI Scholar Responsibilities:
The DSI Scholar will work directly with Dr. Fereshteh Zandkarimi (Co-PI, mass spectrometry expert) and will be responsible for the following:

• Data preprocessing: Develop Python scripts to load, clean, and normalize LC–MS/MS spectral data from 300+ cannabis extract files (mzML/mzXML format), including handling missing values and normalizing intensities across samples.

• Spectral feature encoding: Implement and compare approaches to convert raw mass spectra into numerical vectors for modeling (e.g., binned intensity profiles or peak-based feature matrices).

• Bioactivity prediction: Train and evaluate supervised regression models (regularized regression, random forest, gradient boosting) using spectral features to predict neuroprotective activity scores from cell-based assays, with performance assessed on held-out samples.

• Visualization: Build interpretable visualizations of model outputs, including dimensionality-reduced chemical space maps and feature importance plots, to help the experimental team understand which spectral features drive predictions.

• Fraction prioritization: Build a simple ranking script that scores new extracts by predicted bioactivity, giving experimentalists a prioritized list of HPLC fractions to test next in the lab.

• Final report: Prepare a short written summary of methods, results, and recommendations for next steps, suitable for inclusion in a manuscript in preparation, and for the next phase of the project.

CANDIDATE REQUIREMENTS

Required Skills:

• Experience with Python for data analysis and manipulation

• Basic understanding of machine learning concepts and ability to apply standard methods

• Experience working with structured datasets, including data cleaning, handling missing values, and normalization

• Familiarity with data visualization in Python

• Ability to write clear, organized, and reproducible code

The scholar will work closely with experts in mass spectrometry and chemistry. No prior background in mass spectrometry or natural product chemistry is required; we will provide the necessary scientific context.

Student Level (Please select all that apply):

• Master’s

• Senior
School: School of Nursing
Department: Office of Scholarship and Research
Campus: CUIMC

Project Overview:
Advances in large language models (LLMs), particularly retrieval-augmented generation (RAG)-based systems, offer opportunities to enhance the accessibility of behavioral interventions by grounding chatbot responses in verified, domain-specific knowledge.

DSI Scholar Responsibilities:
In Aim 1, we will conduct semi-structured qualitative interviews with intervention participants (n=15) and primary care clinicians (n=10) to evaluate perceptions of chatbot use, elicit design preferences, and identify safety considerations. In Aim 2, we will transcribe and analyze nurse coaching session recordings using directed content analysis to develop a behavioral taxonomy of coaching strategies, minority stress coping techniques, and cultural tailoring approaches. This taxonomy, combined with stakeholder input, clinical data, and evidence-based resources, will be used to train and preliminarily test a RAG-based chatbot prototype. Usability testing with 5–10 participants will assess feasibility, acceptability, and refinement needs. This study will be among the first to integrate AI augmentation into a minority stress-informed CVD risk reduction intervention, with findings informing a future efficacy trial and advancing scalable strategies to reduce cardiovascular health disparities.

CANDIDATE REQUIREMENTS

Required Skills:

• Proficiency in Python and experience with LLM frameworks such as RAG orchestration tools

• Experience with prompt engineering, including system prompt design, few-shot prompting, and response optimization for domain-specific applications

• Understanding of natural language processing fundamentals including text preprocessing, chunking strategies, and tokenization

• Exposure to LLM evaluation methods, including both automated metrics and human-in-the-loop assessment of response quality, safety, and fidelity

• Familiarity with responsible AI practices, including bias mitigation, content filtering, and safety guardrail implementation

• Experience working with sensitive or health-related data in secure computing environments is preferred but not required

• trong written and verbal communication skills, with the ability to document technical workflows and contribute to research deliverables

Student Level (Please select all that apply):

• Master’s

• Senior
School: Mailman School of Public Health
Department: Environmental Health Sciences
Campus: CUIMC

Project Overview:
Our team has developed a new computational framework, inferential artificial intelligence (iAI), leveraging the variational autoencoder (VAE) model from deep learning, and combines it with Bayesian inference to accelerate spatiotemporal public health research, achieving many orders of magnitude faster sampling efficiency than commonly used Markov Chain Monte Carlo (MCMC) methods—the method called PriorVAE and its most recent extension PriorCVAE.

DSI Scholar Responsibilities:
We will: (i) pretrain and disseminate PriorVAE model for the United States to demonstrate useability and scalability of spatiotemporal analysis; (ii) apply tools to critical use cases, including estimating patterns and trends in census tract-level life expectancy in the United States over recent decades; and (iii) explore how A-LSD design can provide efficient strategies for new data collection.

CANDIDATE REQUIREMENTS

Required Skills:

We are looking for students with a high degree of competency in computer science and methods derived from that, and an interest in public health applications.

Core technical skills

• Strong programming in R and/or Python (data manipulation, modeling, reproducible workflows)

• Experience with Bayesian statistics (hierarchical models, CAR/ICAR/BYM structures, uncertainty quantification)

• Familiarity with machine learning / deep learning, ideally including variational methods (e.g., VAEs)

Spatial & data science skills

• Experience working with spatial data (shapefiles, adjacency matrices, GIS concepts)

• Ability to handle large spatiotemporal datasets (panel data, merging across sources)

• Knowledge of data preprocessing and cleaning pipelines

Modeling & evaluation

• Experience implementing and comparing statistical models (e.g., Bayesian vs ML approaches)

• Understanding of model validation, benchmarking, and performance metrics

• Ability to interpret and visualize uncertainty in spatial estimates

Software & reproducibility

• Experience with version control (e.g., Git / GitHub)

• Ability to write clean, well-documented, and reproducible code

• Familiarity with packaging or sharing code (R packages or Python modules)

Desirable (not required)

• Experience with spatiotemporal epidemiology or public health data

• Familiarity with MCMC methods and tools (e.g., Stan, INLA)

• Exposure to active learning or adaptive sampling methods

• Experience working with U.S. Census or mortality datasets

Soft skills

• Ability to work independently and meet milestones (~10 hrs/week)

• Clear communication of methods and results to an interdisciplinary team

Student Level (Please select all that apply)

• Master’s

• Senior
School: VP&S
Department: Neurology
Campus: CUIMC

Project Overview:
To address our research questions, we have assembled a multidisciplinary team and plan to perform a retrospective chart review of patients who received a solid organ transplant at NYPP from 2012 – 2025. We will create 2 cohorts, those who had a clinical encounter at the Division of Aging and Dementia (i.e. clinical concern for cognitive impairment) after their transplant, and those who did not. We will create profiles for those patients with clinical concern for cognitive impairment, including type of organ transplant, age at transplant, gender, use of different immunosuppressive medication, co-morbid psychiatric disease, and the presence of other medical comorbidities. We will determine the prevalence of clinical cognitive impairment in the transplant recipients by type of organ transplant, and further analyze how that has changed over time. We will additionally determine there is any association between post-transplant cognitive impairment and poor health outcomes, including dementia diagnoses. If patients underwent formal neuropsychiatric testing, we will determine if there are any patterns in cognitive performance.

DSI Scholar Responsibilities:

• Data analysis: The data discussed above has been obtained from EPIC using a TRAC request. The DSI scholar will write code in Stata and/or other programs to perform descriptive statistical evaluations of the patients with post-transplant cognitive impairment and those without. Additionally, the scholar will determine if there are significant correlations between cognitive impairment and the clinical variables mentioned above, as well as between cognitive impairment and clinical outcomes (hospital admissions, ED visits, organ rejection, mortailty rates). The scholar will have some discretion to chose appropriate analyses, and in conjunction with the faculty mentor and co-authors can determine if additional analyses are appropriate. If the scholar is interested in answering a specific question not addressed above using the available data, we will support a subgroup analysis or other sub-project.

• Manuscript preparation: DSI scholar will draft the statistical analysis portion of the methods section, and produce tables or other graphical representation of the above analysis for a manuscript. This will be submitted for publication, and the DSI scholar will be listed as an author on the final manuscript.

CANDIDATE REQUIREMENTS

Required Skills:

• Data cleaning

• Stata experience

• Coursework: Foundational data analysis courses, ideally some courses in biostatistics (i.e. Intro to Health Data Science (14780)), coding experience (Stata, R would be a plus but not strictly necessary)

Student Level (Please select all that apply)

• Master’s

• Senior

• Junior

• Sophomore