Available projects for Fall 2025 will be posted below in Spring 2025.
Please note that in order to be considered for Fall 2025 projects, students must be enrolled for the Fall 2025 semester.
Students are welcome to apply for up to 5 projects per term. You must submit a separate application for each project. If you submit more than 5 applications, we will randomly select 5 of your applications for submission.
For more information about the program, including the program benefits, application process and timeline, please visit the DSI Scholars Student Information Page.
Faculty interested in participating in Fall 2025 are encouraged to review the DSI Scholars Faculty Information page for details.
School: Vagelos College of Physicians & Surgeons
Department: Psychiatry
Project Overview: Decoding behavioral signifiers for the brain states and decisions can have far reaching implications for understanding the neural basis for actions and identifying disease. We are using high resolution video recordings of mice as they navigate mazes but have access to very few pre-determined behavioral signifiers. Computer vision can be used to extract a variety of previously unreachable aspects of behavioral analysis, including animal pose estimation and distinguishable internal states. These descriptions allow for the identification and characterization of behavioral dynamics, which determine decision making. Applying such computational approaches to mice during exploration and in the context of behaviors that have been validated to measure choice and memory can reveal dimensions of behavior that predict or even determine psychological constructs like vigilance, arousal, and memory. We are also obtaining neural signal data, which can be aligned with the behavioral signifiers.
DSI scholars would use pose estimation analysis to evaluate behavioral signifiers for choice and memory and relate it to our real time concurrent measures of neural activity and transmitter release. The students would also have the opportunity to examine the effect of disease models known to impair performance on our tasks on any identified signifier.
CANDIDATE REQUIREMENTS
Required Skills: MATLAB, Python, familiarity with statistics
Student Eligibility: Master’s, Senior, Junior, Sophomore, Freshman
International Students on F1 or J1 Student Visa: Yes, eligible
School: Fu Foundation School of Engineering and Applied Science
Department: Earth and Environmental Engineering
Project Overview: The Western United States is facing intensifying regional droughts and escalating wildfire risks, both of which are projected to worsen under climate change. In response, cloud seeding has gained renewed interest as a potential tool for augmenting water supplies and mitigating wildfire risks. Currently, ten states and provinces in Western North America operate cloud seeding programs. However, the overall efficacy of cloud seeding remains contentious, largely due to the challenges of distinguishing its effects from natural meteorological variability and due to concerns about the effects of pollutants on human populations. Moreover, operational strategies for cloud seeding are hampered by limitations in our fundamental understanding of cloud microphysics and the difficulty of simulating these processes under realistic atmospheric conditions.
Since 1972, the Weather Modification Reporting Act has mandated the documentation of weather modification activities in the United States. This project aims to compile and analyze these historical records to provide a comprehensive overview of weather modification efforts over the past five decades. The study will utilize large language models (LLMs) to extract and synthesize key information from reports, creating a unique dataset that tracks the prevalence and context of weather modification technologies. This dataset will be cross-referenced with historical climate data to examine the meteorological conditions under which cloud seeding has been deployed, offering insights into its potential efficacy.
To further refine the analysis, an Invariant Causal Prediction Framework will be employed to identify consistent patterns in the use of weather modification technology in relation to climatic drivers. By integrating historical records, climate data, and causal inference methods, this project will provide a nuanced understanding of the role weather modification has played in managing water resources and mitigating climate risks in the Western United States.
DSI Scholar Responsibilities include:
1. Compile historical records on the usage of Weather Modification in the US since 1972.
2. Use LLM’s to synthesize information from 1026 past historical records of Weather Modification Usage to develop a data set of Weather Modification usage including the locations, dates, materials, and purpose of Weather Modification Activities.
3. Cross-reference locations and dates with historical climate data sets to understand the context under which Weather Modification has previously been used.
4. Investigate an Invariant Causal Prediction framework to identify consistent patterns in the use of Weather Modification Technology.
Required Skills:
– Fluency in Python or R (preference for python)
– Experience with LLM’s
Student Eligibility: Master’s, Senior, Junior
Department: Department of Industrial Engineering and Operations Research
Project Overview: The objective of this project is to develop a comprehensive framework for preemptively assessing the safety of self-driving cars in a new urban environment prior to their deployment. The project will start from database construction, by processing heterogeneous raw data from police reports in natural language to street views in satellite images. Based upon that, we will develop innovative transfer learning methods for the evaluation of existing driving algorithms, and construct a traffic simulator to analyze future algorithms. Leveraging counterfactual analysis, we aim to inform the regulatory decisions surrounding the introduction of self-driving cars. Additionally, we will explore post-entry safety assessment mechanisms for ongoing monitoring and improvement.
The DSI Scholar will:
– Integrate multiple datasets from different sources into a database for convenient query.
– Preprocess data using large language models and image recognition techniques.
– Build risk models to estimate the accident rate of autonomous driving vehicles in several environments.
Required Skills: Proficiency in Python, especially its common machine learning libraries. Experience with natural language processing.
School: Columbia Business School
Department: Finance
Project Overview: This project aims to explore the implications of U.S. labor mobility on employers’ decisions regarding health insurance plans for their employees. According to Census data, employers cover over 86% of insurance plans for private employees in the U.S. These employer-based insurance plans are vital for the well-being of employees and their families, yet they also represent a significant operational cost for companies. As labor mobility and job turnover increase, employers may face reduced incentives to offer robust risk-sharing in health plans, as the benefits of investing in healthier workers can be easily lost to competitor firms through poaching. This dynamic is contributing to the rise in high-deductible health plans, which shift more cost and risk onto employees.
The project leverages a comprehensive regulatory index, which I have already collected and constructed, detailing how each state regulates and enforces non-compete agreements in labor contracts. This index will help us understand how changes in state regulations, which influence labor mobility, ultimately affect the health insurance coverage provided by employers.
The project utilizes raw data from IRS Form 5500s, which provides insights into firms’ insurance choice decisions, and a proprietary commercial insurance claim dataset covering more than 40 million enrollees in employer-based plans, offering granular data on individual enrollment and health expenditures. Our goal is to establish causal evidence that links changes in labor mobility, induced by state policies, to firms’ insurance supply decisions and individual medical utilization. To achieve this, we will employ statistical methods such as econometrics and causal inference. Additionally, the DSI Scholar will be tasked with applying Natural Language Processing (NLP) algorithms to process and analyze the raw Form 5500 data, ensuring that we extract meaningful insights to inform our study.
The DSI Scholar will play a crucial role in the successful execution of this project, focusing on data processing, analysis, and methodological application. Their responsibilities will include:
1. Pre-processing the IRS data
a) Utilize Natural Language Processing (NLP) techniques to extract and structure relevant information from raw IRS Form 5500 data. This will involve parsing text data to identify and categorize insurance plan details and related variables.
b) Link the cleaned Form 5500 data to external databases such as S&P Compustat
2. Analysis of Commercial Insurance Plan Data:
a) Identify the insurance plans utilized by individual enrollees in the commercial insurance claim dataset. Explore the features of these insurance plans, such as deductibles, co-pays, and coverage options, and investigate any possible plan switches among enrollees over time.
b) Analyze medical expenditures for inpatient and outpatient visits, identifying trends and patterns that may inform the broader study on labor mobility and insurance choices.
3. Data Analysis
a) Conduct exploratory data analysis to uncover initial patterns, trends, and insights within the datasets. The DSI Scholar will generate descriptive statistics and visualizations to provide a clear understanding of the data landscape.
b) Support the application of econometric models and causal inference techniques to assess the impact of labor mobility on employer health insurance decisions. This may include running regression analyses, propensity score matching, or instrumental variable approaches under supervision.
4. Documentation and Reporting
– Proficiency in Python for data processing and analysis with skills in Natural Language Processing (NLP) for text data extraction and analysis.
– Strong foundation in econometrics and causal inference techniques with experience with longitudinal data analysis. Proficiency in at least one of the following statistical software: Stata, R, and SAS.
– Effective communication skills for documenting processes and presenting results.
Student Eligibility: Master’s, Senior, Junior, Sophomore
Department: Systems Biology
Project Overview: Background: We have recently developed ‘SCRuB’, a machine learning model that removes contamination from microbiome samples by analyzing datasets of collected DNA to infer their true microbial components (Austin et al., Nat Biotechnology 2023). We showed that this method, through a unique expectation maximization framework, improves the power of microbiome research, allowing for stronger clinical applications ranging from cancer to preterm birth. Despite SCRuB’s success, we know there is opportunity for more improvements.
Project: The aim of this project is to extend our existing SCRuB method by incorporating even more biological structures into its expectation maximization model. While the original method effectively incorporates microbiome compositions, the method would improve by developing statistical frameworks that would allow it to utilize other biological data commonly available in microbiome research.
This project will be conducted in three steps
1. designing the machine learning methodology that allows SCRuB to effectively use additional biological data points;
2. implementing the software using a programming language of your choice;
3. evaluating how your implementation can strengthen the power of a microbiome analysis.
All steps will involve close collaboration with members of the lab. Upon the successful culmination of the project, the student will be encouraged to publish their findings as a peer-reviewed manuscript as well as present at a scientific conference.
Required Skills: Preferred backgrounds would include: Data Science; Machine Learning; Expectation Maximization; Python, R or other languages used for data science. Experience with biology is not a strict requirement.
School: CUIMC
Project Overview: Cognitive flexibility is an executive function that is necessary to flexibly adapt previously learned behaviors to changing environmental demands. This cognitive function enables an individual to “look at things differently” and to adapt to one’s environment, instead of engaging in perseverative thinking that can lead to rumination and mental rigidity. However, the neurobiological mechanisms underlying cognitive flexibility in healthy and disease-relevant conditions are largely unknown.
The goal of this project is to employ state-of-the-art machine learning analysis of mouse neuroimaging and behavior data to understand cellular and neural circuit mechanisms regulating cognitive flexibility. In our experiments, real-time single cell neural activity data was recorded with head-mounted miniature microscopes from a large population of neurons in freely moving mice while they were trained to perform a complex decision-making task. In this task, mice had to learn that a set of features (odor, texture, and location) was associated with a hidden food reward. Upon learning the initial feature-reward association over 30 trials, the reward predicting features were changed and the mice had to learn that a different set of features was now associated with reward. Using machine learning techniques, some of which were developed in our lab, we want to understand how neural representations of feature-reward associations emerge in the brain and how the dynamic evolution of these representations during trial-and-error experience impact decision-making behavior.
This exciting data science project utilizes highly innovative in vivo Ca2+ imaging data sets of neural activity from freely behaving mice with and without in vivo neural circuit manipulations, providing students the opportunity to apply computational analyses techniques to provide unprecedented insight into how the brain controls behavior.
The student will work closely with other Ph.D. students and postdocs who will provide hands-on training, and will be mentored by the PI through regular meetings. The main analyses techniques include Representation Similarity Analyses (RSA) to determine how the brain represents information by comparing the similarity of neural response patterns across different stimuli or conditions. Representation Evolution Analyses (REA) utilize support vector machines (SVMs) and linear classifiers to determine trial-based neural and behavioral response patterns, followed by cosine similarity analyses to determine changes in the neural coding axis over the course of learning and reversal learning. All analyses pipelines and scripts are available in the lab for the student to use.
The complexity of the data set will provide ample opportunities for the student to learn, develop, and apply different types of computational analyses. We have clear hypotheses for the student to test with our established analysis pipeline. In addition, the complexity of the data also provides opportunities for the student to develop their own new questions to ask from the data set. The data set is clearly defined so the student can get started on the analyses without delay. The student will acquire mentorship from a lab with leading experience in analyzing scientific data and a successful track record in supervising students.
Some experience with Python programming, especially with the sklearn library would be beneficial. Linear classifiers and representational similarity analysis (RSA) are the main tools we use, and we have a pipeline for a new analysis that was developed in our lab, which we named “Representational Evolution Analysis (REA)”. This analysis leverages support vector machines (SVMs) and linear classifiers to determine how neural representations dynamically evolve as a function of learning, and how they flexibly adapt during reversal learning/ cognitive flexibility. Training on this new analysis pipeline will be provided through hands-on training by the student mentor.
Student Eligibility: Master’s, Senior
School: Arts & Sciences
Department: Department of Earth & Environmental Sciences
Project Overview: Climate and oceanographic observations provide us with a valuable view of a changing world, yet they are limited to little more than a single human lifespan. In order to consider these observations in the broader context of the passage of time, proxy data from a range of archives can be utilized. These data have provided powerful evidence of abrupt climate changes in the past, and implicate the important role of the ocean in these changes. Although proxy data are immensely useful and remain the only way to assess natural variability in the climate system, they are often scattered in space and discontinuous in time, presenting a barrier to their full utilization.
This project involves the compilation and visualization of climatic and oceanographic datasets from the last time the Earth was as warm as today. These data represent important characteristics and processes, including sea-surface temperature, continental and sea-ice, ocean currents, and deep ocean carbon storage, initially recorded in deep-sea sediments and subsequently analyzed in paleoclimate laboratories around the world. The project will involve compiling these data from online repositories and other sources, and then using interpolation schemes to generate a series of visualizations in the form of maps and cross sections during different intervals (“time-slices”) through the past warm interval that will render the existing information more accessible to climate scientists, oceanographers, policy-makers and the general public.
The individual visualizations will be useful as stand-alone time-slices through the progression of a warm climatic interval that was analogous to the modern, but without the intervention of human interactions. The sequence of visualizations may also be combined into video animations that portray the previous natural evolution of the ocean and climate during a warm interval that can be compared directly to ongoing changes.
With guidance from the PI, the DSI scholar will initially compile the data from online sources. They will then assemble them spatially and temporally in order to generate maps and oceanographic cross sections. These visualizations will require the development and application of interpolation schemes to turn the scattered data into continuous views that provide a state-of-the-art estimate of oceanographic and climatic conditions from each of ten intervals of time from the previous warm interval. This is likely the main and central accomplishment of the project, although additional steps may include generating animated visualizations with interpolations through time as well as space, and the comparison of maps and ocean sections to the modern equivalents in order to evaluate the anomalies associated with human influence on the climate system. Through the course of the project, the DSI scholar will have the opportunity to interact with other members of our research group, including undergraduate and graduate students, and will have the option to spend time at the Lamont-Doherty Earth Observatory campus.
Required Skills: Fluency in Python or data analysis packages such as MatLab will be helpful, although not required. Similarly, experience with data mining techniques may be advantageous, but will not be necessary.
Department: Radiation Oncology
Project Overview: Radiotherapy is a cornerstone of cancer treatment that utilizes ionizing radiation to destroy malignant cells. By accurately delineating, or segmenting, tumor target and surrounding organ-at-risks (OARs), the treatment planning process will guide the treatment machine to deliver precise radiation dose to tumor while sparing healthy surrounding tissues to minimize side effects. Despite the advances of end-to-end deep learning models in automated medical image segmentation, due to the inherent challenges in cone beam computed tomography (CBCT) such as low soft tissue contrast and limited image quality, the current fully automated segmentation methods usually fail to consistently achieve satisfactory results for clinical use. As a result, their outputs may require significant manual adjustments, which has become a bottleneck in time-sensitive practices such as online adaptive radiotherapy (oART).
The main purpose of this DSI scholar project is to develop an artificial intelligence (AI)-driven interactive tool for image segmentation in adaptive radiotherapy using visual prompted foundation models and reinforcement learning. This tool will be developed through two special aims: 1) Development of a web-based interface: Create a user-friendly web interface that accepts user inputs – such as clicks, scribbles and bounding boxes – to guide the interactive segmentation process. The segmentation will be powered by visual prompt-based foundation models that are adapted for CT images. 2) Optimization of interactive contour refinement. Optimize the dynamic process of contour refinement through reinforcement learning, aiming to achieve desired segmentation with the fewest possible iterations.
We have identified the roadmap to expand our web-based automated image segmentation system to an interactive tool. Our desired end goal is that this interactive tool can significantly shorten oART treatment time. This will reduce the risk of patient movement during treatment, offering potentially more effective treatment options for cancer patients.
1. Implement a web-based interface that takes user’s inputs (prompt) to guide interactive segmentation.
2. Finetune Segment Anything Model (SAM) 2 using LoRA for 3D CT images, and make it work on the system developed in 1.
3. Assist in investigating approaches to optimize the dynamic process of contour refinement with the initial results obtained from automated segmentation algorithms for efficient and effective contour refinement.
4. Present results to the group and prepare for potential publication or further development.
1. Familiarity with web programming using JavaScript/HTML/CSS and WebGL
2. Fluency in Python and PyTorch
3.Experience with medical image analysis using packages such as ITK and MONAI
4. Experience with reinforcement learning is desired
School: Columbia Climate School
Department: Lamont Doherty Earth Observatory
Project Overview: Phytoplankton are tiny photosynthetic organisms that live in the sunlit areas of oceans and freshwater bodies. They play a crucial role in converting CO2 dissolved in water into organic compounds that sustain nearly all marine life, while producing over half of the oxygen in our atmosphere. Due to their ability to fix CO2, phytoplankton are vital for understanding carbon sequestration, climate regulation, and supporting fisheries. With around 5,000 species, studying them is essential for monitoring the health of aquatic ecosystems and life on Earth.
Traditionally, microscopy has been used to study phytoplankton, but it is slow, costly, and labor-intensive. While newer imaging technologies have sped up this process, they still require manual handling and expert classification. At Lamont-Doherty Earth Observatory, we modified a commercially available imaging system to automate the imaging of particles and plankton in water samples. This system can continuously capture phytoplankton images while a ship is moving, allowing data collection across large areas and over time. In the last two years, we have field-tested this system, amassing millions of images from oceans, coastal areas, and rivers.
However, the slow manual classification process is still a challenge. Our goal is to overcome this by developing a Computer-Assisted Automated Phytoplankton Classification System (CAPCS) using advanced computer vision and deep learning techniques. This will enable rapid, accurate identification of phytoplankton species based on unique features, transforming data collection.
This innovation is critical for NASA’s hyperspectral ocean color sensors, like PACE, EMIT, GLIMR, which aim to detect major phytoplankton groups from space. Overcoming these challenges will revolutionize water quality, marine pollution, climate change, and fisheries sciences, meeting the growing demand for high-resolution data from both field and satellite observations.
DSI Scholar Responsibilities
1. Develop AI Models:
– Design and implement deep learning models for phytoplankton image classification.
– Apply computer vision techniques to improve accuracy and efficiency.
2. Data Management:
– Clean and preprocess large phytoplankton datasets.
– Use data augmentation to enhance model robustness.
3. Optimize Algorithms:
– Test and refine AI algorithms to address limitations and improve performance.
– Stay updated with advancements in AI and machine learning.
4. Collaborate Interdisciplinary:
– Work with Goes and other researchers to integrate AI with ecological and environmental sciences.
– Bridge computer science, statistics, and environmental science in research efforts.
5. Evaluate Models:
– Assess model performance through rigorous validation and cross-validation.
– Ensure accuracy and robustness of AI solutions.
6. Documentation and Reporting:
– Document methodologies and results thoroughly.
– Prepare reports and presentations indicating progress of the work
1. Machine Learning & Deep Learning:
– Proficiency in implementing machine learning algorithms, especially Convolutional Neural Networks (CNNs), and advanced deep learning methods like Recurrent Neural Networks (RNNs) and Transformers.
2. Computer Vision:
– Strong understanding of image processing, object detection, and segmentation for analyzing phytoplankton and microplastic images.
3. Feature Selection & Dimensionality Reduction:
– Knowledge of methods to manage and optimize high-dimensional data.
4. Statistical Analysis:
– Foundation in statistical methods, including spatial statistics, for robust data interpretation.
5. Programming Skills:
– Proficiency in Python or R
6. Model Evaluation & Optimization:
– Skills in evaluating and optimizing machine learning models for enhanced performance.
Funding Note: This is a grant funded project. Exact amount of funding will depend on hours completed.
Project Overview: Ocean color remote sensing has long been used to map phytoplankton functional types (PFTs) in the upper ocean, traditionally relying on the ratios of photosynthetic pigments chlorophyll-a and accessory (non-photosynthetic) pigments like chlorophyll-b and carotenoids. However, these methods often fall short in distinguishing complex PFT compositions due to overlapping pigment absorption peaks and the limited spectral resolution of traditional multi-spectral ocean color sensors.
The advent of hyperspectral remote sensing, notably through NASA’s PACE mission and the upcoming GLIMR mission , offers continuous spectral coverage from the ultraviolet to near-infrared wavelengths, significantly enhancing the ability to differentiate between various phytoplankton pigments. Hyperspectral data capture detailed spectral features, which are critical for accurate pigment identification and PFT classification that is important for fisheries, carbon sequestration and climate change studies.
Recent advancements incorporate Artificial Intelligence (AI) techniques such as Linear spectral unmixing, Independent Component Analysis, Gaussian Mixture Models, Finite Mixture of Skewed Components (FMSC), etc., to overcome limitations of traditional algorithms. Traditional methods decompose pigment absorption spectra into Gaussian components, but these often face challenges with overlapping absorption peaks and limited spectral resolution. The FMSC algorithm, however, encodes spectral shapes in a finite metric space, providing a more nuanced representation of spectral data and improving the accuracy of pigment retrieval.
This study will utilize HPLC pigment data obtained from the field and use hyperspectral ocean color data for:
1. Developing AI and other complex statistical methods to improve the accuracy of distinguishing between complex mixtures of pigments.
2. Use field pigment datasets to evaluate the performance of various algorithms against conventional spectral decomposition techniques.
3. Apply the algorithms developed to satellite data for improved global monitoring and analysis of PFTs from space.
The Scholar will:
1. Data Acquisition: Assist in obtaining and managing field hyperspectral optical, and HPLC pigment data from NASA SEABASS database for field ocean color data.
2. Data Cleaning and Preprocessing: Prepare hyperspectral datasets for analysis by removing noise and normalizing data.
3. Algorithm Development and Implementation: AI Algorithm Development: Implement and train AI models for pigment retrieval using hyperspectral data. This involves coding, testing, and optimizing machine learning algorithms.
4. Algorithm Integration, Analysis and Validation: Apply and refine various statistical methods to hyperspectral data sets for accurate pigment and PFT identification. Analyze spectral data to extract relevant features and validate the accuracy of the FMSC and AI algorithms: Compare the performance of various AI approach with traditional methods Gaussian Curve fitting methods
5. Data Interpretation and Reporting: Translate algorithmic pigment outputs into meaningful insights about phytoplankton communities and their spatial distributions.
6. Data Management and Documentation: Refine code and prepare detailed workflow for testing by other ocean color scientists. Prepare reports, be willing to give a presentation at a NASA meeting and be part of publications.
7. Application of the algorithms to Satellite fields of hyperspectral ocean color data from PACE to generate regional and global maps of PFTs.
– Fluency in R and or Python and experience in working with big data files in particular ncdf format files.
– Capable of querying databases, extracting and pairing of datasets for algorithm development and algorithm performance evaluation.
– Knowledge of the use of AI based statistical approaches for extracting pigment information from hyperspectral datasets.
– Using these algorithms for mapping phytoplankton functional types from satellite data.
Department: Computer Science
Project Overview: Existing quantum platforms, such as IBM’s Qiskit allow off-site users to access the platforms. Long-term, we are interested in using minor architecture discrepancies to identify the specific machine a computation has been performed on (much like fingerprinting, PuFs for classical systems). More specifically, current quantum hardware requires high levels of error correction techniques to maintain the states of computation. Our approach is pick simple computations, run them on the various machines and observe statistical differences in the syndromes used to indicate how to perform error correction. Currently, we are using support vector machines to perform the inference, but would like to consider alternative classification strategies.
Required Skills: Familiarity with various ML methods (SVMs, Decision trees, neural nets, transformers) and/or familiarity with interfaces to systems/packages that can apply these methods to collect results.
Department: Earth and Environmental Sciences
Project Overview: The ocean carbon sink accounts for roughly 25% of annual anthropogenic CO2 emissions. To understand past changes, diagnose ongoing changes, and to predict the future behavior of the ocean carbon sink, we must understand its spatial and temporal variability. To monitor this key climate service, in particular, air-sea CO2 fluxes across the globe are needed for us to monitor year to year changes. However, the ocean is poorly sampled, and the sparsity of measurement in space and time makes the estimates of such fluxes challenging. In the McKinley group, we have developed several Machine Learning (ML) techniques to reconstruct the ocean carbon field based on association to satellite-based full-field driver data. These machine learning algorithms interpolate sparse surface ocean pCO2 observations to global coverage.
Understanding the value of different data sources to these ML algorithms is an active area of ML research. The spatio-temporal nature of the observed data makes it difficult to understand the impact of specific observations on the performance of the ML estimation. This DSI Scholar will develop approaches to quantify the contribution of individual pCO2 observations to ML interpolation algorithms using Explainable ML methods.
More specifically, with the Data Shapley framework (Ghorbani and Zou, 2019), we plan to assign a specific value, or score, to each data point in the available database. We will also quantitatively evaluate how alternative sampling patterns would change algorithmic skill. To do this, we will use a multi-model, multi-ensemble ‘testbed’, as we have in a range of previous studies.
In Spring 2025, the DSI Scholar will begin by learning about methods and data needed for this project, such as the pCO2-Residual product (Bennington et al. 2022, JAMES, doi:10.1029/2021MS002960) and output from Earth System Models (ESMs) which are used for the testbed. They will review existing code and help develop improved workflows with a strong focus on data sharing and reproducibility. They will then work with us to begin to implement the Data Shapley for data valuation. The student will also contribute to analysis of the reconstructed ocean carbon field and be included in publications resulting from this work.
Required Skills: Strong Python and ML skills are required – please discuss both in your application.
Student Eligibility: Master’s
School: School of International and Public Affairs/Arts & Sciences
Department: Economics
Project Overview: We have a number of projects looking at the diffusion of legal ideas in the United States and Canada. One project involves learning the structure of citations and the diffusion of ideas in the US federal judiciary. We have the text of all Federal cases back to 1800, along with the network of citations cases make to each other. We want to look at “”breakthrough”” federal cases that replace all future citations to the things they cite, or which have embedding distance far from the things they cite, but close to the cases that cite them. Then we will use this to rank influential legal cases in US history, and we will ask our GPT-4o based summarizer to translate them into accessible language.
The other project uses an existing corpus of collective bargaining contracts in Canada, and the DSI scholar will scrape the universe of judicial labor arbitration cases. The idea is that the language and concepts articulated in judicial decisions will diffuse into the text of collective bargaining agreements, as lawyers coordinate the judicial language. We will look at embedding distances between contracts and arbitration opinions.
The DSI scholar will a) process the judicial opinions dataset and implement the two breakthrough measures, and b) scrape text data from the CANLII database. We will use s-bert to measure embeddings.
Required Skills: Python, and specifically expertise with networks and embeddings would be helpful.
School: Climate School Department: Seismology, Geology and Tectonophysics Project Overview: Volcanology has transformed into a highly data-driven and computationally focused field. Numerous computational models have been developed to simulate various physical phenomena during volcanic eruptions. A critical component of forecasting the behavior of volcanic eruptions is assessing the probability of different outcomes and scenarios. For this, scientists implement probabilistic modeling approaches, which provide such assessments. The DSI scholar who will join this project will be tasked with creating workflows that perform simulations and generate hazard maps with quantitative probability assessment. The workflows will be created using Jupyter Notebooks and be based on a range of eruption simulation tools written in different languages. The workflows will utilize probabilistic tools such as Markov Chain Monte Carlo (MCMC) or the Ensemble Kalman Filter (EnKF). The objective of this project is to provide a comprehensive and flexible tool for members of the volcanology community to utilize. In the long term, this tool could be expanded to use in any geophysical process that requires uncertainty quantification.
Student Eligibility: Master’s, Senior, Juniors only.
School: Vagelos College of Physicians and Surgeons
Department: Ophthalmology
Project Overview:
Motivated by the global prevalence of untreated vision impairment, we seek to address the need for more accurate and timely diagnosis for irreversible eye diseases. Traditionally, ophthalmologists rely on 2D optical coherence tomography (OCT) reports derived from raw 3D OCT data, but this approach can lead to errors, especially in cases of atypical ocular anatomy or imaging artifacts. Although raw 3D OCT data provides a more comprehensive view of the retina, its complexity and time-intensive analysis present significant hurdles to its practical use. This DSI Scholars project has two aims: (1) develop a deep learning (DL) model that transforms seamlessly between 2D OCT reports and 3D underlying OCT data, and (2) visualize this 3D data for ophthalmologists through augmented reality/virtual reality (AR/VR).
Aim 1: Working closely with ophthalmologists at CUIMC, we will acquire annotations within temporal regions of 3D OCT data that relate to anatomical features of importance in 2D OCT reports, providing 3D-to-2D mapping information to train our DL model. Architectures to be explored include 3D UNets, autoencoders, or 2D Vision Transformers applied on subsets of slices from 3D OCT volumes (employing self-attention and cross-attention across patches from multiple slices).
Aim 2: With our generative 2D-to-3D transformation model from Aim 1, we will design a method for clinicians to visualize synthesized 2D and 3D data simultaneously through AR and/or VR. This will involve integrating the model developed in Aim 1 into the Unity Engine for real-time processing of inputs and visual rendering for ophthalmologist users.
By implementing these aims, our 2D-to-3D AI transformation and AR/VR visualization system will empower ophthalmologists with comprehensive insights derived from 3D OCT data. By extracting features not accessible through traditional 2D analysis alone, our approach has the potential to assist in expediting care for those suffering from vision impairment.
– Fluency in Python and Pytorch
– Awareness of/willingness to probe literature related to 3D UNets, Vision Transformers, Autoencoders
– Familiarity with Unity Engine, Augmented Reality/Virtual Reality development and workflow
– Interest in ophthalmology, ability/interest to work collaboratively in a team of engineers and physicians
School: Zuckerman Mind Brain Behavior Institute
Department: Biology
Naturalistic animal behavior is built from simpler behavioral modules that reflect the organization and function of the underlying neural circuits.
To understand the parental behavioral differences among Peromyscus mice, we track freely moving adult mice while they are retrieving pups that where removed from their nest. We use marker-less 3D pose-estimation software based on machine learning (SLEAP and DeepLabCut) and extract meaningful parameters including velocity, acceleration, turning, orientation of the adult to the pup, and distance between pup and the adult mouse. We already have a test dataset of ~200 individual pup retrieval sequences for which we extracted the kinematic parameters.
Next, we want to classify individual pup retrieval sequences based on these parameters using random forest or autoregressive hidden Markov model algorithms to identify behavioral modules that the adult animal is routinely performing during this task.
In the future, we want to use these trained models to predict behavioral modules on a new dataset for which we also recorded the mice’s brain activity.
Required Skills: The ideal candidate would be comfortable with basic probability as well as multivariate calculus and linear algebra. They will have to implement models and algorithms in Python, so coding proficiency is important. No biological background is strictly needed as we will teach the candidate everything that is needed to successfully finish the project.
School: Climate School
Department: International Research Institute for Climate and Society
Sub-seasonal-to-seasonal (S2S; weeks to months) time-scale predictions have great potential societal benefits, such as early warnings of heavy rains, droughts, and heat waves. Reliable forecasts a few weeks ahead can provide invaluable tools for routine planning in the agriculture, water resources, public health, and humanitarian aid sectors. However, the skill of current S2S forecasts made using large physics-based climate model ensembles is limited, partly because model simulations depart from reality. Calibration of climate-model forecast probabilities is necessary to account for model deficiencies and produce reliable forecasts. State-of-the-art calibration methods in IRI’s climate predictability tools are linear, which is a limitation because the atmospheric flow is inherently non-linear, and model errors often grow exponentially.
The goal of this project is to develop machine learning/artificial intelligence (ML/AI) forecast tools that enable non-linear bias correction to meet the growing service demands on improved forecast products at S2S time scales. The intern will code and run test cases to compare the performance of different ML methods (e.g., Regression Trees, CNNs, deep learning) to improve Indian summer monsoon probabilistic forecast skill by bias-correcting/calibrating sets of S2S forecast ensembles from large physics-based climate models run at global climate forecasting centers (e.g., NCEP, ECMWF) and archived in IRI Data Library.
– Fluency in Python coding and libraries, Jupyter Notebooks, using GitHub repos.
– Experience with using various ML methods (e.g., Regression Trees, CNNs, deep learning) is required.
– Experience with climate data and model output would be an advantage, but not required.
Department: Radiology
Mild traumatic brain injury (mTBI), also known as concussion, remains largely invisible on standard MRI images despite survivors of car accidents or falls or domestic dispute victims report neurological and cognitive symptoms. Some patients recover within months, but as many as 60% remain symptomatic at 6 months and 30% or more suffer for years and many for the rest of their lives. Lack of accepted and easily adopted clinical diagnostic tools severely limits identification of the subgroup with poor prognosis, development of interventions and treatment options as well as advancement in understanding the underlying mechanisms of the injury.
We use diffusion tensor imaging (DTI), a widely available quantitative MRI technique to detect and visualize subtle abnormalities in microstructure of white matter of the brain in these patients. Patient’s images are compared to those of healthy controls in voxel-by-voxel manner to localize areas of abnormality. Computationally, this is a very CPU-intense process, which requires tasks including image processing, image registration and robust statistics.
The main goal of this project is to develop an AI-based algorithm to speed up identification of localized regions of abnormalities. A large data repository, including processed images from healthy controls and mTBI patients, is available within the Translational Neuroimaging Laboratory at CUIMC to develop and train the algorithm and perform its testing. The project aims to achieve the following targets for key performance indicators of the new algorithm:
Required Skills: Experience working with various ML/AI models (e.g. RESNET, UNET, Inception, VGG), documentation/organization skills, familiarity with Linux
Department: Seismology, Geology and Tectonophysics
Volcanology has transformed into a highly data-driven and computationally focused field. Numerous computational models have been developed to simulate various physical phenomena during volcanic eruptions. A critical component of forecasting the behavior of volcanic eruptions is assessing the probability of different outcomes and scenarios. For this, scientists implement probabilistic modeling approaches, which provide such assessments. The DSI scholar who will join this project will be tasked with creating workflows that perform simulations and generate hazard maps with quantitative probability assessment. The workflows will be created using Jupyter Notebooks and be based on a range of eruption simulation tools written in different languages. The workflows will utilize probabilistic tools such as Markov Chain Monte Carlo (MCMC) or the Ensemble Kalman Filter (EnKF). The objective of this project is to provide a comprehensive and flexible tool for members of the volcanology community to utilize. In the long term, this tool could be expanded to use in any geophysical process that requires uncertainty quantification.
– Fluency in Python is required. – Familiarity with statistical methods and notation is highly preferred. – We expect the scholar to be comfortable with reading academic papers or other high-level readings to familiarize themselves with the concepts. – Other scripting skills and programming language knowledge will be beneficial to work as well.
School: School of Nursing
Department: Nursing
The DSI Scholar will work on cleaning and analyzing data for two projects focused on the influence of daily discrimination on cardiovascular disease risk. The first project is our DSI Seed Grant, which was a 30-day daily diary study that investigated the impact of daily discrimination on sleep health in a sample of Black and Latinx LGBTQ+ Adults. Although we have 3 papers under review and 2 papers in progress from the DSI Seed Grant, we have additional data that needs to be analyzed. Specifically, we plan to use unsupervised machine learning to identify sleep phenotypes and their associations with daily discrimination. The next project, which was recently funded by the National Heart, Lung, and Blood Institute, is a 1-week daily diary study that investigates the influence of anticipated and vicarious discrimination on home blood pressure. The DSI Scholar will assist our team with maintaining our study database, completing ongoing data cleaning, tracking data collection, developing code for data analysis, and addressing any data concerns (as needed).
– The DSI Scholar should have fluency in R/Python and an interest in health disparities research. – Familiarity with machine learning and multilevel modeling is preferred but not necessary. – We have longitudinal actigraphy and daily diary data that the DSI Scholar will help analyze but prior experience is not needed.
School: Mailman School of Public Health
Department: Epidemiology
The goal of this project is to predict climate related physiologic stress on participants in a study on overland migration and nutrition security. The project has two leads and mentors (M. Orjuela-Grimm and Robbie Parks (Environmental Health Sciences)). Tasks to fulfill study aims include working with time- referenced and georeferenced data from reported migration trajectories of 104 Latin American overland migrants during the summer and fall of 2023, and possibly summer 2024. The data needs to be matched to date specific climate variables (ambient air temperature (considering daily ranges), humidity) and geographically specific elevation, and then modeled to consider changes that may act as physiologic stressors, taking into consideration trajectory (departure point) and geographic challenges. Each migrant will have daily data points from 14 to 100 days. Data sources will include ERA5, ERA5-Land as well as other sources. Modeling strategies may include estimating heat stress with Wet Bulb Globe Temperature and Heat Index.
The data can potentially be combined / compared with indicators from the water insecurity experience scale collected from the same population.
The end goal is to create a method to approximate such stressors and model their potential impact on health related indicators in migrants in overland migration routes. Ultimately the data will be used to help inform health related service provision at migrant shelters in Mexico. The work would be expected to result in data that would serve for an abstract submission at the end of the fall semester, with subsequent poster presentations, and potential manuscript submission. The data is from a multinational pilot study funded by the Institute of Latin American Studies.”
– Skill sets include fluency in R, Python, GitHub, project management / Python, as well as an interest in geographic information, climate aspects, and a working knowledge of Spanish (fluency is an advantage) and of geography in Central America and Mexico.
Department: Biomedical Engineering
Significance: Reward learning is a core cognitive function that allows humans and other animals to consistently make decisions that optimize behavior and enhance survival. Deficits in reward learning are common in neuropsychiatric disorders, impairing patients’ ability to efficiently interact with their environment. Despite extensive studies in rodents, the generalizability of reward learning mechanisms to humans remains poorly understood. This research aims to close this gap by identifying the computational principles that govern reward learning in the brain. Understanding these mechanisms is critical for identifying disruptions in reward-based processes associated with disorders, thereby improving biomarker identification and enhancing health analytics for better clinical outcomes. Furthermore, by decoding the biological principles of reward learning, this research could lead to the development of a new class of energy-efficient reinforcement learning (RL) models that employ cortical coding schemes.
Approach: This research project focuses on probing the neural codes and computational algorithms the brain uses to tackle dynamic reward-learning tasks. We are particularly interested in tasks where the optimal solution not only changes over time but is also influenced by the policies of other agents within the environment. To investigate this, we will create a simulated environment initially featuring a single agent, with a second agent introduced later. Each agent will be constructed as a biologically-plausible reinforcement learning model, each employing distinct, time-varying learning rules. By simulating a wide range of learning rule functions for each agent, we aim to elucidate biological reward-learning mechanisms at play both in individual scenarios and in more complex settings where an agent’s policy is affected by dynamic environmental changes, including the strategies adopted by other agents. Working alongside myself and our interdisciplinary team of computational and systems neuroscientists, a DSI scholar will play a critical role in developing this biological multi-agent RL platform and systematically reverse-engineering the agents dynamics.
– ML/AI, deep learning models (prior experience in reinforcement learning (RL) models is strongly preferred), advanced programming, foundational knowledge in linear algebra, calculus, and statistics.
In our lab we record fluctuations in neurotransmitter levels in the brains of mice in real time while they undergo tests of learning and decision-making. This is accomplished by measuring fluorescent signals from genetically encoded optical biosensors using fiber photometry (Simpson et al., Neuron. 2024, PMID: 38103545). Our overall goal is to understand the neurobiological basis of behaviors disrupted in psychiatric and neurological disorders. The specific aim of this project is to determine how expecting effort influences choice. Effort-based decision making is variable across healthy individuals (from “work-shy” to “workaholic”). For many psychiatric patients an exaggerated weighting of anticipated effort results in debilitating apathy and amotivation.
We collected dopamine recordings from multiple brain regions simultaneously in mice performing effort-based decision tasks in our custom automated test chambers. The DSI Scholar will work on this dataset together with the PI and the lab members that designed, collected, and pre-processed the data. The scholar will use non-linear multiple regression to determine which task events and behavioral measures (including dichotomous and continuous variables) predict the dopamine signals in each brain region. Because some segments of the behavior are self-paced, we will use dynamic time-warping to align some events. Because different physiological processes modulate dopamine release on different timescales, we will also perform dynamic regression modeling by adding lags as explanatory variables.
Expected Outcomes:
– Python, multiple regression models, time series analysis, documentation (e.g. jupyter notebooks), and data/code sharing platforms e.g. OSF, Github).
– Neuroscience background knowledge is a plus, but basic biology will suffice (all aspects of the data collection and biological relevance will be explained).
– An interest in psychology/psychiatry related research and a desire to work collaboratively with the research team (including undergrads, grad students, postdocs, and associate research scientists).
Department: Advanced Consortium on Cooperation, Conflict and Complexity (AC4)
“Hate Speech” in on-line media can incite conflict and violence in real life. We study “”Peace Speech”” that leads to positive prosocial behaviors that support sustainable peaceful conditions in nations throughout the world. Our interdisciplinary team at the Advanced Consortium on Cooperation, Conflict and Complexity (AC4) at the Columbia Climate School, includes researchers in psychology, social psychology, environmental sustainability, natural resource governance, applied anthropology, and data science. We have already successfully used machine learning to identify the words in on-line news media that best classifies countries as lower or higher peace, published in 2023 in PLOS ONE, https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0292604.
In those and subsequent studies, we used logistic regression, random forest, XGBoost, SVM, BERT, and XLNet to analyze on-line data from both news media and social media. In this new work we will substantially extend those studies by using new and powerful methods from Artificial Intelligence (AI): 1) to further identify the linguistic differences between lower and higher peace societies, 2) to reveal the social processes that underly those linguistic differences, and 3) to create a real-time dashboard of the levels of peace and the processes that support them. To accomplish these tasks we will use pre-trained AI systems, such as ChatGTP, Claude, and Bard, as well as fine tuning those systems with additional data from studies of the social psychology of peace. Because of implicit and/or explicit bias in the data used to train those proprietary models and the “guardrails” that limit their responses, we may need to explore our own training of open source models such as Llama 2 from Meta and Mixtral from Mistral. This work will advance our scientific understanding of the social factors that enhance peace as well as provide valuable, practical insights for policy makers to support
Required Skills: Fluency in Python, natural language processing (NLTK, spaCy, BERT, XLNet), longitudinal analysis (time series), machine learning (logistic regression, random forest, XGBoost, SVM, TensorFlow, PyTorch). The project will be centered on using AI, familiarity with models like openAI’s ChatGTP, Anthropic’s Claude, Google’s Bard, Meta’s Llama, Mistral’s Mixtral and tools like prompt engineering in AI chat models and fine tuning and training methods using langchain, vector databases such as Pinecone, models like davinci-002 and Ada, will be very helpful. The short term goals are to use AI to characterize the properties of “Peace Speech”, and to identify the social processes that they represent. The longer term goal is to create a user friendly dashboard to monitor the current levels of peace in societies for academic research and policy makers.
School: School of Social Work
Department: Social Work
We are conducting a pilot study designed to assess the feasibility and potential promise of the large language model (LLM)-based artificial intelligence (AI) chatbot approach to assist current/future service providers working with LGBTQ+ populations in learning and utilizing the latest science-based knowledge about LGBTQ+ issues and intervention. During the June – August 2024 period, the project goals are to finalize and implement evaluation and benchmarking tools to assess the quality (i.e., validity/accuracy compared to the empirical scientific knowledge base) and utility of outputs from popular and promising existing LLM AI chatbots. The evaluation and benchmarking tools involve human- and machine-driven approaches. We seek a DSI-supported skilled student to assist and participate, particularly in the machine-driven evaluation/benchmarking, as well as in identifying prevalence and conditions that result in chatbot hallucinations. As interested and appropriate, the student could also assist with the wider study activities (e.g., human-driven evaluation/benchmarking, developing and/or training with an appropriate corpus, publication, and presentation of findings, and grant writing). We also anticipate areas where the student may contribute/develop their areas of interest/specialization outside of the current pilot study (e.g., methodological issues with experimental research with LLM AI chatbots). We note that there should be DSI Scholar appropriate work in the September – December 2024 timeframe as well.
The types of tasks that might be required to fulfill the study aims include:
– Implement machine-driven evaluation of the study’s selected chatbots
– Use appropriate statistical or data analysis tools
– Learn and contribute regarding data provenance and detailed records of research procedures
– Ensure all research activities comply with ethical, equity, and safety standards.
– Attend project team meetings and perform administrative activities as needed
– Contribute and/or lead presentations and publication of findings, implications, etc.
– Understanding of concepts and techniques used in LLM and LLM implementation.
– Skills in text processing, language modeling, and understanding the nuances of human language (natural language processing).
– Programming and Software Engineering: Proficiency in programming languages like Python and knowledge of software development practices and tools.
– Ability to work with large datasets, including data cleaning, analysis, and visualization.
– Designing robust and scalable systems to support machine learning applications.
– Knowledge of cloud services and distributed computing for training and deploying large language models.
– Skills in designing user-friendly interfaces and understanding user needs for applications like ChatGPT.
– Keeping up with the latest AI research and being able to implement or adapt new findings.
– Understanding of the ethical, equity, and safety implications of AI and developing systems responsibly.
– Working effectively in multidisciplinary teams and communicating complex concepts clearly.
– Writing and analytical skills.
School: Arts and Sciences / Climate School
Department: Earth and Environmental Science / LDEO
The ocean significantly mitigates climate change by absorbing fossil fuel carbon from the atmosphere. Cumulatively, since preindustrial times, the ocean has absorbed 40% of emissions. Marine Carbon Dioxide Removal (mCDR) are proposed engineered efforts to supplement the ocean’s natural uptake of anthropogenic CO2 from the atmosphere. A major challenge for mCDR is to quantify the additional carbon removal from the atmosphere given the large natural background carbon sink. Better understanding of the natural air-sea CO2 fluxes at regional scales is therefore required before mCDR additionality can be quantified.
To understand past changes, diagnose ongoing changes, and to predict the future behavior of the ocean carbon sink, we must understand its spatial and temporal variability. However, the ocean is poorly sampled and so we cannot do this directly from in situ measurements. In the McKinley group, we have developed several data science techniques to reconstruct ocean carbon data based on association to satellite-based full-field driver data. With this project, we wish to determine how well current and future ocean carbon observations can constrain background air-sea CO2 fluxes in potential mCDR deployment regions.
In summer 2024, the DSI Scholar will begin by learning about methods and data needed for this project, such as the pCO2-Residual product (Bennington et al. 2022, JAMES, doi:10.1029/2021MS002960) and output from Earth System Models (ESMs). They will review existing code and help develop improved workflows with a strong focus on data sharing and reproducibility. We look forward to having their expertise to improve machine learning methods in order to produce pCO2 products at smaller scales, specifically in areas of potential mCDR deployment. The student will also contribute to analysis of the reconstructed ocean carbon data and be included in publications resulting from this work.
Required Skills: Fluency in Python, experience with foundational ML
This highly innovative and significant Data Science Institute Seed Project application will use a machine learning informed natural language processing (NLP) approach to qualitatively identify patterns and reasons for engaging in opioid-related polysubstance use and narratives around overdose and HIV risk behaviors from publicly available discussion forums on Reddit, a popular social media platform, which provides a ready-made source of abundant, naturalistic, first-person narratives for understanding substance use behaviors and patterns. This work takes an interdisciplinary approach by integrating data science, substance use epidemiology, and public health to improve our understanding of polysubstance use patterns. We propose to use human-in-loop machine learning approach, specifically NLP method, to analyze the patterns from unstructured Reddit comments to automatically cluster large similar unstructured text data and unearth latent patterns of polysubstance use and qualitatively explore the trends, patterns, and themes. Data collection for this project will rely on a “human-in-loop” or “supervised” natural language approach with the following steps:
1. data retrieval from opioid-related subreddit of interests,
2. feed algorithm with key drug terms to develop polysubstance use topics,
3. use the algorithm developed topics to extract polysubstance relevant subset of data,
4. select a random sample of the data, and
5. conduct a rapid review of the sample.
We will follow steps two through five until the random sample consists of polysubstance use posts, overdose, and HIV related behaviors. Data will be analyzed using directed content analysis, using Latent Dirichlet Allocation (LDA) to infer latent substance use topics from the comments posted by redditors. Four focus groups ranging from four to eight participants will be recruited to ecologically validate the NLP findings and capture the lived experiences of people who engage in opioid-related polysubstance use among people who use drugs.
Required Skills: Fluency in R/Python, methods for Natural Language Processing, Latent Dirichlet Allocation (LDA), sentiment analysis, supervised and unsupervised machine learning, predictive modeling, etc.
School: School of Engineering and Applied Science
Department: Civil Engineering and Engineering Mechanics
The reliability of public charging infrastructure is paramount for the successful transition to road transportation electrification. Consumers need to perceive it as dependable to consider shifting to electric vehicles (EV) or avoid reverting back to internal combustion engines. To ensure a reliable charging infrastructure network, faulty or unusable chargers need to be swiftly identified and repaired.
While standard monitoring can detect several failures, such as those in software and the electrical system, other failures like broken connectors or physical impediments hindering drivers from successfully charging are not currently captured [1]. Addressing these issues often requires expensive physical monitoring or relies on customer reports. However, a shift from typical charging point utilization may indicate the potential presence of undetected faults.
This project aims to explore a variety of alternative unsupervised learning techniques for anomaly detection. The goal is to identify and predict anomalous EV charging point use in public charging points using publicly available charging transaction data. The project will also analyze the relationship between the occurrence of anomalies or their duration and the characteristics of the charging point, such as venue type, location, and pricing category. Additionally, normal utilization metrics will be examined to identify any patterns related to maintenance issues.
The project utilizes public nationwide data from the US Department of Energy (specifically the EV-WATTS datasets) as well as other datasets.
[1] Karanam, V., Tal, G. (2024) Enhancing Electric Vehicle Charger Reliability: Developing a Tool to Swiftly Detect Hidden Charger Faults, Poster Presentation, 2024 TRB Annual Meeting.
Required Skills: The student will be proficient in python programming and time series data modeling. LSTM autoencoders models are amongst the time series anomaly detection techniques in time series that will be tested.
Department: Emergency Medicine
In febrile infants younger than 30 days, lumbar puncture (LP) is a procedure routinely performed to evaluate for meningitis. LPs are mainly performed in the emergency setting by clinicians and trainees. However, novice success rates are historically poor with over 60% failure rates that can lead to diagnostic uncertainty, prolonged pain, and unnecessary resource utilization. Reduction of unsuccessful and traumatic LPs in infants can improve diagnostic ability and reduce patient harm. Ultrasound performed at the point-of-care has the potential to increase LP success rates through improved visualization of the anatomy, however it is dependent on the skill of the operator to interpret findings accurately thereby limiting it’s efficacy in the population of providers that most needs it.
The main purpose of this project is to use a pre-existing ultrasound database of ultrasound spinal anatomy videos to develop an artificially intelligent algorithm that can identify the important anatomic structures for planning an infant lumbar puncture procedure.
We have already successfully designed a binary classification system using a limited dataset. Our next step is to work on object localization to help identify specific anatomic features of interest.
The specific aim is to design an object localizer for specific spinal anatomy using a corpus of ultrasound data and test accuracy of algorithmic feature recognition against expert labels in a hold-out set. Our secondary aim is to deploy the algorithm on a website or tablet to test real-time processing of ultrasound data.
To fulfill this aim, the team will need to achieve the following tasks:
1. Assist with object-level annotation of features
2. Use machine learning to develop intelligent algorithm for automated feature recognition
3. Test algorithm accuracy against expert gold standard
4. Deploy algorithm on website or local tablet to test real-time processing of data
We have a labelled data-set of 1515 frames with binary classification of anatomic features and an augmented dataset of 11224 frames.
Our desired end goal is a functional algorithm that can identify key features on spinal anatomy on ultrasound at a threshold of >95% accuracy.
Required Skills: Experience working with various ML/AI models (e.g. RESNET, ALEXNET, VGG), documentation/organization skills (e.g. jupyter notebook, github), html (optional for parsing real-time algorithm).
School: Columbia College
Department: Latin American and Iberian Cultures
Project Overview: This project forms part of my book manuscript, Sorcery and the City in Post-Slavery Brazil. My project analyzes 135 witch trials that occurred during the first half of the Twentieth-century in Brazil to better understand why colonial anti-witchcraft made a comeback during the first decades of abolition and the first Brazilian Republic. My thesis is that witchcraft accusations were a means to uphold spatial and social divides and segregate cities like Rio de Janeiro, without the need to create racial segregation in written law. Witch hunts allowed the police to uphold state ideologies of racial and class divisions.
The main type of data I have collected are street addresses of where accused witches lived in Rio de Janeiro during the period from 1881-1942. I would like to work with a data science assistant to help me map these addresses onto old and contemporary maps of Rio de Janeiro to do a spatial-historical analysis to determine if these witch hunts did indeed reinforce spatial divides and segregation.
Required Skills: The research assistant should have cartography/mapping skills to visualize geographical space (Rio de Janeiro city, state, and neighborhoods). RA should be able to work with historical maps, create maps, and use contemporary maps (google maps).
School: Graduate School of Arts and Science
Department: Ecology, Evolution and Environmental Biology
Project Overview: The Urban Wildlife Information Network (UWIN) was created by the Urban Wildlife Institute at the Lincoln Park Zoo as an alliance of urban wildlife scientists committed to conducting research to enhance our knowledge of urban wildlife and their relationships with people. While the UWIN project spans multiple universities and other stakeholders across the world, within NYC alone we at the Eco-Epidemiology Lab at Columbia University have a transect of nearly 50 wildlife cameras placed in parks and greenspaces along an urbanization gradient from Brooklyn to the furthest reaches of Nassau County. Our intent is to measure the effects of human occupancy and degrees of urbanization on wildlife and disease vectors-species richness and abundance.
A study of this scale comes with an ever-increasing amount of data, and in our case, this data comes in the form of hundreds of thousands of images of NYC’s local wildlife! While processing this information is traditionally done by staff, students, and volunteers pouring through these images and identifying the number and species of wildlife in each image, we are modernizing our approach with machine-learning AI technologies (such as Megadetector) to automatically detect and identify the species and quantity of wildlife present in these images, then attach this information to the image’s metadata and upload it the larger inter-city UWIN database. While we plan for this project to continue for many years, we are looking for students now to help create and implement a machine-learning model to identify and catalog our current and future sets of raw images by training said model on our 200,000+ already manually processed images as well as developing a pipeline to automate the processing of re-training of the model on future sets of images.
Required Skills: Some Python coding experience is required, and anything beyond is a plus. Previous experience with machine-learning and/or image analysis is preferred, but not necessary. No previous knowledge of wildlife identification or ecological principles is needed, but an interest in the natural sciences and local wildlife is highly encouraged.
Department: Pediatrics
We are looking for a student who will join our studies on the impact of the prenatal environment on brain development. We have developed and are studying a unique mouse model for placental dysfunction that has autism-like behaviors, particularly in male offspring (Vacher et al., Nat Neurosci, 2021). We have RNA sequencing data from multiple brain regions from both mice that had placental insufficiency and matched controls across development. We have examined some of these data sets already but we now aim to analyze the RNA sequencing data specifically from the hippocampus, a critical brain region involved in memory and mood regulation.
The student will utilize bioinformatics tools to analyze RNA sequencing data from the mouse hippocampus. They will identify genes and pathways that are differentially expressed and associated with placental dysfunction and autism. This analysis will be conducted at different developmental stages to identify any deviations in its developmental trajectory of the hippocampus in our autistic model compared to neurotypical brains. The project will also investigate the influence of biological sex, a significant factor in autism. Furthermore, the student will perform statistical analyses to determine the significance of the findings, taking into account variables such as genotype, sex, and age.
– Identification of differentially expressed genes and pathways associated with placental dysfunction and autism in the hippocampus
– Insights into the molecular mechanisms underlying the link between placental dysfunction and autism
– Contribution to scientific knowledge through research publications and presentations
Required Skills: The student should be proficient in R. Familiarity with R packages for RNAseq analysis such as DESeq2, ggplot, and GSEA, as well as visual presentation of sequencing data is a plus. Interest in developmental biology, neuroscience or medicine would be advantageous.
Department: Zuckerman Mind Brain Behavior Institute | SNF Center for Precision Psychiatry & Mental Health
Nervous system gives rise to behavior and behavior reflects pathological brain function. Understanding the pathophysiology underlying mental disorders and providing innovative therapeutic avenues requires the detailed study of symptomatology in experimental animal models. During the last decade, pose estimation approaches are revolutionizing animal tracking. Researchers from Columbia University have developed a state-of-the-art machine learning package, namely the Lightning Pose (LP), that tracks freely-moving animals’ pose, enabling to study behavior with unprecedented accuracy. This package provides the 3D coordinates of behaving mice body parts that can then be subjected to various sophisticated analyses of behavior, including its dissection into regressive modules, and the analysis of their transition probabilities through sequences of behavior.
Present project aims to analyze the LP-generated mouse pose data, using available (Keypoint-Moseq, VAME), currently developing (Lightning) and custom-made machine learning or mathematical- and statistical- modeling analysis pipelines. This will allow us to gain novel insights on the effects of rare mutations that are considered to be the strongest etiological factors of schizophrenia currently identified, on behavior and associate them with disease symptomatology. Additionally, this will allow us to configurate a high throughput working pipeline to assess the effects of conventional, and innovative experimental therapeutic approaches in the framework of precision psychiatry.
The student will be working with csv files containing multivariate time series of x,y coordinates for a set of mouse body parts extracted from video data via previously existing algorithms (LP). In this context, the student will use python to apply mathematical and statistical tools under the guidance of the supervisor in order to:
– Detect repetitive patterns in the mouse pose that lead to the identification of behavioral modules/motifs.
– Assess transition probability across these patterns.
– Highlight the differences among different experimental groups (i.e. mutant or drug-treated mice).
– The student is expected to have fluency in python (numpy, scipy, pandas, matplotlib, seaborn) with experience in code writing, pipeline building and debugging.
– Basic statistics, including an understanding of significance testing.
– Basic machine learning (linear regression and classification, clustering).
– Experience with deep learning is a plus (training and evaluating models on GPUs).
– Experience modeling time series is a plus (RNNs, NLP/text analyses, HMMs, Kalman Filters).
– Importantly, the student should be interested in applying their skills to psychiatric neuroscience, and to actively participate in a collaborative working environment.
School: Columbia University Irving Medical Center
Department: Department of Biomedical Informatics
Electronic health records (EHR) provide a population-scale resource to improve the diagnoses of rare diseases, which go unrecognized by most providers due to lack of familiarity. This project aims to leverage cutting-edge biomedical informatics and data science methodology to develop, validate, and demonstrate the clinical utility of an EHR-driven approach for rare diseases clinical decision support systems. Support for diagnosis of rare diseases will enable patients and providers to move efficiently beyond diagnoses to treatments and support for their condition. The types of tasks include training early diagnostic models using EHR data, optimizing the model to overcome any potential bias across different genetic ancestries, and developing visualization tools to provide an explanatory dashboard for clinical decision support. In this project, our aim is to develop a methodology to efficiently identify potential rare disease candidates from large EHR pools. The identified dataset will subsequently undergo manual review and labeling to serve as a training dataset for other supervised learning tasks. The end goal includes manuscript submission and a reproducible pipeline that can be generalized to other external institutions.
– Proficiency in programming languages such as R and Python is essential. Familiar with packages such as pandas or dplyr. The student should be able to write clean, efficient code to extract insights from data. The student with experience in working with diverse datasets, including longitudinal data, and structured/unstructured sources, is highly valuable. The student should possess the ability to clean, preprocess, and integrate data effectively. Skills in data visualization tools and libraries (e.g., Matplotlib, Seaborn, ggplot2) are a big plus.
– A strong foundation in machine learning and statistical analysis is necessary for building predictive models, conducting hypothesis testing, and extracting meaningful patterns from data.
– Skills with front-end app development (React) or experience with Javascript will be a big plus
– Experience with natural language processing and large language model will be a big plus
Department: School of Nursing
Average Hours per Week: Approximately 10
Stipend Amount: $3,000
Project Overview: The DSI Scholar will work on cleaning and analyzing data for two projects focused on the influence of daily discrimination on cardiovascular disease risk. The first project is our DSI Seed Grant, which was a 30-day daily diary study that investigated the impact of daily discrimination on sleep health in a sample of Black and Latinx LGBTQ+ Adults. Specifically, we plan to use unsupervised machine learning to identify sleep phenotypes and their associations with daily discrimination. The next project, which was recently funded by the National Heart, Lung, and Blood Institute, is a 1-week daily diary study that investigates the influence of anticipated and vicarious discrimination on home blood pressure. The DSI Scholar will assist our team with developing and maintaining our study database, completing ongoing data cleaning, tracking data collection, developing code for data analysis, and addressing any data concerns (as needed).
Required Skills: The DSI Scholar should have fluency in R/Python and an interest in health disparities research. Familiarity with machine learning and multilevel modeling is preferred but not necessary. We have longitudinal sensor and daily diary data that the DSI Scholar will help analyze but prior experience is not needed.
Department: Lamont-Doherty Earth Observatory
Stipend Amount: $3000
Project Overview: Our group has recently developed a new method for computing the flow of light through the earth’s atmosphere (https://doi.org/10.1029/2023MS003819) – as task that’s key to climate projections and weather forecasting. The method relies on data-driven optimization: one defines a set of states over which to optimize, makes detailed, computationally expensive reference calculations based on those states, then identifies a very small optimal subset of the reference calculations that can be used as a proxy for the fully detailed calculations. The method is appealing in part because it’s flexible – it can be applied to arbitrary conditions with arbitrary cost functions for optimization.
We’d like to make it easier for people to use this idea for their own purposes, starting with using the tools ourselves to do a more complete and complicated version of the idealized problem we first took on. One task will be taking the original set of (clean, modular!) Python scripts and Jupyter notebooks and developing these into a fully general Python package that can be distributed via PyPi and Conda for wider use. During the course of this development we’ll apply the tools to the complete range of greenhouse gasses in the atmosphere, which may require identifying or developing smarter ways of allowing many small contributors to vary at once.
If successful the project stands to have an immediate impact – the group has collaborators at both weather forecasting and climate modeling centers who are interested in using a mature version of this technique.
Required Skills: The project requires fluency in scientific Python, the ability to refactor code from scripts into Python modules, and the willingness to develop automated testing, packaging, and distribution. Ability and willingness to discuss the underlying physical science would be an advantage.
School: Vagelos Physicians & Surgeons
Department: Medicine
Project Overview: Diagnostic errors affect up to 12 million adults per year and result in serious harm or death. Incorrectly ordered imaging tests are a major cause of missed diagnoses; however, little is known about why these errors occur. Current methods measuring imaging order errors are limited by reporting bias and the need for chart review. To address these gaps, I propose applying an innovative, systematic approach, the Retract-and-Reorder (RAR) method, to develop automated measures to identify imaging order errors. Electronic health record data (EHR) will be queried to identify imaging RAR events, defined as imaging orders placed, retracted, and subsequently reordered for the same patient with an element of the order changed. We aim use the RAR method to detect imaging order errors with a high accuracy. I aim to develop the first automated wrong-imaging order error measures to 1) examine the epidemiology of imaging order errors in a large healthcare system and 2) provide reliable outcome data for studies to trial system-level interventions to reduce these types of errors, to improve diagnostic safety and accuracy. Specific tasks will include working with a preexisting relational database in a server from the department of biomedical informatics. This database will have robust EHR clinical and log data. From this database will use data-driven methods to design the queries for the measures to identify diagnostic imaging order errors. We will use quantitative and qualitative analyses in a mixed-methods research approach to inform query specifications to identify these types of errors with high accuracy.
Required Skills: Fluency in SQL Server Management Studio is preferred, but not necessary. Fluency in SQL, Python, or R is also preferred, but not necessary.
International Students on F1 or J1 Student Visa: Not eligible
Department: Center for International Earth Science Information Network (CIESIN)
Project Overview: CIESIN is interested in identifying open plastic dumps that are potentially vulnerable to climate change. The health and environmental risks as well as social justice issues posed by open plastic dumps can be compounded by climate change events.
A DSI Scholar would provide coding and other technical support within the context of this global plastics project through two parallel work streams.The first is to extract values for land use disturbance, flooding, changes in rainfall and temperature extremes, and demographic information from large datasets and assign these to the polygons delineating plastic dumps boundaries over time. The resulting dataset will be explored to identify the climate risks associated with individual open dumps and the populations that could be impacted. The expected platform to be used is Google Earth Engine and coding in python or Java. The second workstream is to locate and link plastic trade related import and export data to the relevant countries and potentially the actual open plastic dumps.
Required Skills: Fluency in scripting languages for data analysis experience with import export data preferred
School: Teachers College
Department: Human Development
Project Overview: Generative AI has shown great promise for education, but who it might actually benefit in practice is a serious equity concern. This project aims to shed light on this dilemma by examining systematic disparities in public responses to generate AI in education, including 1) institutional academic policies; 2) students’ online discussions; and 3) relationships between these responses and institutional characteristics. Project tasks may include: 1) acquiring and cleaning large-scale text and administrative data via web scraping or APIs; 2) performing NLP tasks such as sentiment analysis and topic modeling, potentially using LLMs; and 3) statistical analyses, reporting, and data visualization, including geospatial mapping. The findings will provide solid empirical evidence on digital inequalities in the emergence of generative AI and inform best practices to improve educational equity through these technologies.
Required Skills:Qualified students should be skilled in NLP (with Pytorch, Hugging Face, etc.), statistical methods (with R), and have a strong interest in computational social science and a passion for social good. The scholar will work with the research team to contribute to all aspects of the project and lead additional analyses. Students who intend to pursue a doctoral degree in the future is a plus.
Department: Genetics & Development (in Systems Biology)
Project Overview: We are seeking an enthusiastic and motivated undergraduate student to join our research team as an intern, focusing on the analysis of microscopy data to study chromosome rearrangement and loss of heterozygosity (LOH) after DNA damage. LOH is a principle driver of cancer progression and understanding how it is generated after DNA damage has implications for cancer biology. This internship provides a unique opportunity to contribute to cutting-edge research in genetics. The selected candidate will work closely with experienced researchers and gain valuable skills in data analysis and scientific research techniques.
Key Responsibilities:Microscopy Data Analysis: Analyze microscopy images to study chromosome structure and organization after DNA damage. This includes writing scripts for specialized software to quantify chromosomal aberrations, measure distances between specific chromosomal regions, and assess the overall impact of DNA damage on chromosome rearrangement.
Data Interpretation: Interpret and document the results of microscopy analyses, identifying patterns and trends related to chromosome rearrangement. Identify data features for development of machine learning protocols to classify recombination outcomes. Collaborate with colleagues in Systems Biology to implement the algorithm to draw meaningful conclusions from the data and contribute to scientific discussions.
Literature Review: Stay up-to-date with relevant scientific literature on mitotic recombination, LOH, chromosome rearrangement and DNA damage. Summarize and present key findings to the research team.
Documentation: Maintain detailed records of analysis methods, results, and conclusions. Prepare comprehensive documentation and reports for inclusion in scientific publications.
Visualization: Generate clear and informative visual representations of the analyzed data, including graphs, charts, and figures, to facilitate data interpretation and presentation.”
Required Skills: Strong interest in genomics, DNA damage response, and chromosome biology and the desire to help develop large scale data analysis for a microscopy problem. Basic understanding of microscopy techniques, image analysis and familiarity with data analysis software and programming languages (such as Python, R, or ImageJ) would be a plus. Excellent attention to detail, analytical skills, and ability to work independently. Strong communication skills and ability to work effectively in a team-oriented environment.
School: Arts and Sciences
Department: Columbia Justice Lab
Project Overview: The Probation and Parole Reform Project (PPRP), housed in the Columbia Justice Lab, conducts actionable research that challenges the way probation and parole operate in the U.S. We envision a world where probation and parole are smaller, less punitive, equitable, and helpful, and where resources are invested directly to communities in ways that advance collective efficacy, opportunity, and racial equity. As a key part of this work, we seek to understand and publicize the full carceral impact of probation and parole policies, also known as community supervision – a key area of concern is jail detention for technical supervision violations.
While probation and parole were designed to divert people away from incarceration, community supervision is often attached to fees, curfews, and employment or programming mandates. When someone is unable to fulfill these conditions they become at risk of arrest or incarceration due to a technical violation of supervision requirements. Community supervision casts a wide net, surveilling three times as many people as there are in prisons. However, the number of people being incarcerated due to community supervision violations is not captured in current data or policy analysis.
The DSI Scholar will leverage a recently-available jail data to better capture the larger footprint of community supervision, and to identify inequalities in incarceration due to probation and parole across time and space. The dataset contains individual level arrest data for probation and parole violations scraped daily from over 1000 publicly available jail rosters in the U.S. since 2019. The end goal would be to use this data to highlight and better understand the full scope of incarcerations due to technical violations, and design empirically grounded policy recommendations on how to minimize incarceration and reduce racial inequalities within community supervision.
Required Skills: The scholar must be proficient in R and experience with Python is a plus. They should also have experience with web-scraping and database management for large, longitudinal datasets. Experience with data visualization is also essential, including graphical presentations of longitudinal data as well as experience working with and presenting spatial data.
We are also interested in linking administrative datasets. For example, linking jail rosters to voter registration data. For this, the ability to automate data cleaning processes is also highly encouraged. For example, designing algorithms to match individuals across multiple arrest records even when their name is misspelled in a subset of observations.
Department: Civil Engineering & Engineering Mechanics
Project Overview: In recent years, Large Language Models (LLMs), such as GPT-3, GPT-4, and LLama 2, are algorithms trained on extensive datasets, exhibiting exceptional zero-shot learning capabilities across numerous unlabelled tasks. Building on this notion, in-context learning involves conditioning LLMs on specific linguistic instructions or task demonstrations, subsequently enabling them to tackle analogous tasks through sequence predictions. In the field of Travel Mode Analysis, a significant volume of unlabeled data exists. Of particular interest are the unlabelled tweets generated by commuters, which offer insights into evolving travel patterns, especially in the context of events like a pandemic. By harnessing the strengths of LLMs and in-context learning, there exists potential to extract valuable insights from unlabelled data.
Required Skills: Experience in coding in Python. Experience in NLP and PyTorch is preferred.
Project Overview: “Hate Speech” is a term used by peacebuilders, content moderators, policy-makers, and others, to label and categorize language, especially as it shows up in digital media. It is associated with inciting conflict and violence, and it may reflect the conditions of social relations among people across nations. Yet, while hate speech continues, so do other forms of speech that may reflect prosocial behaviors among people around the world as well. What are the properties of this “Peace Speech” that may lead to better outcomes and support continued and sustainable peaceful conditions in nations throughout the world?
Our interdisciplinary team in the Advanced Consortium on Cooperation, Conflict and Complexity (AC4) at the Columbia Climate School includes researchers in psychology, social psychology, environmental sustainability, natural resource governance, and applied anthropology. Together, our team is working to identify linguistic differences from peaceful and less peaceful societies, and the features of “Peace Speech”, that may reflect and support social processes underlying sustainably peaceful conditions. Using 3 data bases, we have already identified individual words that machine learning models use to best classify nations as lower or higher peace. See for example, https://arxiv.org/abs/2305.12537 We now want to cluster those words into topics to identify which topics are most important in differentiating lower and higher peace countries, so that we can gain insight into the social processes that those topics represent.
Required Skills: Fluency in Python, natural language processing (cleaning text, NLTK, spaCy, Google’s BERT, HuggingFace XLnet), longitudinal analysis (time series), clustering analysis (k-means, word2vec, cosine similarity, ChatGTP), machine learning (logistic regression, random forest, XGBoost, support vector machines, neural networks, deep learning). The short term goal is to identify the topics in news and social media that best classifies lower and higher peace countries, topics such as governance, politics, international relations, work, everyday life activities, economics, arts, personal preferences, hobbies, etc. The longer term goal is to use machine learning and AI to identify the social processes that underlie “Peace Speech”.
Department: International Research Institute for Climate and Society (IRI) and Department of Earth and Environmental Sciences (DEES)
Funding: This is a grant funded project. Exact amount of funding will depend on hours completed.
Project Overview: Sub-seasonal-to-seasonal (S2S; weeks to months) time-scale predictions have great potential societal benefits, such as early warnings of heavy rains, droughts, and heat waves. Reliable forecasts a few weeks ahead can provide invaluable tools for routine planning in the agriculture, water resources, public health, and humanitarian aid sectors. However, the skill of current S2S forecasts made using large physics-based climate model ensembles is limited, partly because model simulations depart from reality. Calibration of climate-model forecast probabilities is necessary to account for model deficiencies and produce reliable forecasts. State-of-the-art calibration methods in IRI’s climate predictability tools are linear, which is a limitation because the atmospheric flow is inherently non-linear, and model errors often grow exponentially.
Required Skills: Fluency in Python coding and libraries, Jupyter Notebooks, and use of GitHub repos. Experience with using various ML methods (e.g., Regression Trees, CNNs, deep learning) is required. Experience with large climate data and model output datasets would be an advantage, but is not required.
School: Medicine
Department: Pathology & Cell Biology
Project Overview: The brain is the most complex organ in the body, composed of billions of neurons and trillions of connections between those neurons. Those connections are known as synapses and have been for many years the subject of intense study. What is less clear, however, is how synapses are organized at a population level throughout the brain. To start to address this, we developed a method that analyzes individual synapses using spatial and intensity metrics and scaled this approach to analyze hundreds of thousands of synapses concurrently. By doing so, we found that synapses fall into previously unknown, but functionally-relevant, subpopulations. The student project, which is a collaboration between 2 groups (the Au lab in Pathology and Cell Biology and Menon lab in Neurology) will be to help identify synaptic subpopulations under various experimental conditions and to and to analyze their spatial arrangement in the brain. This will help to reveal functional submotifs in the cortex and glean novel insights into cortical circuit organization.
Required Skills: Fluency in python is a must. Experience with machine learning, pytorch and scanpy preferred. Experience with multidimensional image analysis ideal.
International Students on F1 or J1 Student Visa: Not Eligible