DSI Scholars Projects

Fall 2025 Projects

The DSI Scholars Fall 2025 application is closed.

Eligibility

Please note that in order to be considered for Fall 2025 projects, students must be enrolled for the Fall 2025 semester. Students that are graduating in May or October are not eligible for the Fall application.

Applying to More than One Project

Students are welcome to apply for up to 5 projects per term. You must submit a separate application for each project. If you submit more than 5 applications, we will randomly select 5 of your applications for submission.

For more information about the program, including the program benefits, application process and timeline, please visit the DSI Scholars Student Information Page.

Faculty interested in participating in Fall 2025 are encouraged to review the DSI Scholars Faculty Information page for details.

Important Dates:

May 28, 2025: Student application opens
June 8, 2025 (11:59 PM ET): Student applications due
June 23 – July 3, 2025 (expected): Student interviews
July 18, 2025 (expected): Decision notification will be sent to all applicants by email
September – December 2025 : Get started on your Scholars research project (Approximate duration. Exact dates will depend on the faculty supervisor)

Current Projects

Fall 2025

School: Vagelos College of Physicians & Surgeons

Department: Psychiatry

Project Overview:

This project aims to develop an AI-driven framework that generates voxel-level amyloid-β (Aβ) and tau pathology maps directly from non-invasive T1-weighted (T1w) MRI, providing a cost-effective and accessible alternative to PET imaging. Using paired T1w MRI and PET scans from publicly available datasets (e.g., ADNI, OASIS-3), we will train an image-to-image translation model—built on a 3D encoder-decoder neural network—to synthesize PET-like maps of Aβ and tau pathology.

To enhance model performance and generalizability, we will integrate anatomical segmentations, AI-derived cerebral blood volume (AICBV) maps, and demographic and clinical variables through a multimodal, multi-task learning strategy. Transfer learning from large-scale resources like the UK Biobank and self-supervised tasks (e.g., age prediction) will support robust feature extraction.

Finally, we will refine and validate our models using Columbia’s unique postmortem-confirmed dataset, which includes premorbid T1w MRI paired with histopathological assessments of Aβ and tau. This provides precise ground-truth labels for voxel-level model evaluation. External validation will be conducted using data from the Alzheimer’s Disease Research Center (ADRC).

The end goal is to deliver a clinically deployable tool that transforms structural MRI into spatially detailed, continuous probability maps of AD pathology, offering earlier and more precise diagnostics without the need for invasive or expensive PET scans.

The DSI Scholar will play a key role in supporting data preparation, model development, and validation for our AI-driven framework to map amyloid-β (Aβ) and tau pathology from T1-weighted MRI.

Responsibilities:

1. Data Curation and Preprocessing:

• Organize and preprocess paired T1w MRI and PET datasets from ADNI and OASIS-3, including intensity normalization, co-registration, and skull stripping.

• Assist with extraction and quality control of anatomical segmentation (e.g., via FreeSurfer) and AICBV maps.

2. Model Implementation and Training:

• Help implement the baseline image-to-image translation model using a 3D encoder-decoder architecture in PyTorch or TensorFlow.

• Conduct experiments to compare different model architectures and loss functions.

• Monitor training metrics and manage model checkpointing and version control.

3. Multimodal Integration and Feature Engineering:

• Assist in incorporating demographic, clinical, and functional (AICBV) data into the model via multi-input or multi-task architectures.

• Contribute to the design and testing of self-supervised tasks (e.g., age prediction) to enrich feature representations.

4. Model Evaluation and Validation:

• Compute voxel-level and region-level performance metrics (e.g., Dice coefficient, correlation with PET uptake).

• Support cross-validation and generalizability tests using external datasets such as ADRC.

5. Documentation and Reporting:

• Maintain thorough documentation of all code, preprocessing pipelines, and results.

• Assist in preparing visualizations and summary figures for presentations, internal reports, and potential publications.

This role provides hands-on experience in deep learning for medical imaging, multimodal data fusion, and clinically relevant biomarker development, with opportunities to contribute to high-impact translational research.

CANDIDATE REQUIREMENTS

Required Skills:

The ideal DSI Scholar for this project should possess the following skills and experience:

1. Programming and Data Science:

– Proficiency in Python is required, especially with libraries such as NumPy, Pandas, and Matplotlib for data manipulation and visualization.

– Familiarity with PyTorch or TensorFlow/Keras for deep learning model development and training.

2. Medical Imaging Experience (Preferred):

– Experience working with neuroimaging data formats (e.g., NIfTI) using tools like NiBabel and Nilearn.

– Understanding of MRI and PET modalities, image registration, and basic preprocessing techniques (e.g., intensity normalization, skull stripping, co-registration).

3. Machine Learning and Deep Learning:

– Knowledge of convolutional neural networks (CNNs) and encoder-decoder architectures.

– Familiarity with multi-task learning and self-supervised learning approaches is a plus.

4. Data Management and Workflow:

– Experience with large datasets and version control using Git/GitHub.

– Ability to write clean, reproducible, and well-documented code.

5. Bonus Skills (Not Required but Helpful):

– Experience using FreeSurfer, FSL, or other neuroimaging pipelines.

– Knowledge of biomedical applications of AI, particularly Alzheimer’s disease or neurodegeneration research.

The scholar will receive mentorship and support from the PI and lab members, with opportunities to grow in any areas where they seek further experience.

Student Eligibility: Master’s, Senior, Junior

International Students on F1 or J1 Student Visa: Yes, eligible
School: Columbia Climate School

Department: Center for Climate Systems Research

Project Overview:

The sudden shutdown of the Famine Early Warning Systems Network (FEWS NET) in January 2025 has created a critical gap in global food security monitoring. While the Department of State may maintain some reduced functions, this project introduces an innovative fuzzy logic approach to fill the methodological gap through our FamineWatch initiative. Traditional systems use rigid thresholds to classify conditions into discrete categories (minimal to famine). However, food insecurity exists on a continuum with numerous intermediate states. Fuzzy logic—specifically designed to handle gradient conditions and partial truth—offers a compelling alternative that can detect subtle transitions before conventional systems trigger alarms.

Using Ethiopia as our case study, this project will develop a fuzzy logic framework integrating multiple data streams: climate data (rainfall, vegetation indices), economic indicators (food prices), and social/political factors. We’ll incorporate GDelt for national-level monitoring and ACLED for subnational conflict data, with Factiva news analysis to provide detailed subnational media coverage critical for capturing early signals at local levels. Our system will assign partial membership to different security states, enabling detection of gradual shifts toward crisis conditions. Ethiopia represents an ideal test case given the current famine risks and our established in-country contacts who can provide ground-level verification of model outputs.

This project builds on our existing initiative at faminewatch.columbia.edu, the first component of our envisioned Famine Watch Early Warning Research Coalition. I am in touch with colleagues running the FEWS NET Salvage Operation, which can complement and support our work by providing access to historical data and potential future integration pathways. All code, data, and documentation will be made open-source to facilitate adoption by both the Department of State’s remaining FEWS NET functions and the broader humanitarian community.

The DSI Scholar will implement this method using datasets from reliable sources, creating a well-documented, open-source pipeline with appropriate APIs for potential integration with existing systems. They will evaluate the model against historical food security outcomes in Ethiopia (2018-2022), demonstrating how fuzzy classifications could provide earlier warnings than conventional approaches while creating educational materials to facilitate adoption by the broader humanitarian community

The DSI Scholar will engage in these intellectually stimulating and clearly defined tasks:

• Fuzzy Logic Framework Design

• Data Collection and Preprocessing

• Fuzzy Logic Model Implementation

• Comparative Analysis

• Interactive Visualization

• Documentation and Community Enablement

CANDIDATE REQUIREMENTS

Required Skills:

– Strong Python programming (pandas, numpy, scikit-learn)

– Interest in fuzzy logic and mathematical modeling

– Experience with geospatial data analysis (rasterio, geopandas)

– Data visualization skills (matplotlib, seaborn, plotly)

– API data retrieval and processing (for GDelt/ACLED/Factiva integration)

– Web dashboard development (Streamlit)

– Git/GitHub for version control

Student Eligibility: Master’s, Senior, Junior

International Students on F1 or J1 Student Visa: Yes, eligible
School: Vagelos College of Physicians & Surgeons

Department: Radiology

Project Overview:

In this project, we will train (1) a vision transformer model and (2) an object detection model to identify fracture using the GRAZPEDWRI-DX pediatric wrist fracture dataset which provides pediatric wrist xrays and bounding boxes of fractures (https://pubmed.ncbi.nlm.nih.gov/35595759/). We will (1) evaluate the ability of each to detect fracture in images using binary classification metrics like sensitivity and specificity and (2) evaluate the ability of each model to localize fractures. The hypothesis of this project is that a well-executed vision transformer model will be able to localize fractures with performance comparable to an object-detection model. If this is the case, it would substantially reduce the burden of creating radiology AI models, as creating the bounding boxes needed to train typically-used object detection models is highly labor intensive; creating binary labels for images is much easier and is all that is required for vision transformer models.

The DSI scholar would train the model and run the analysis with guidance from the mentor. This would include: downloading and familiarizing themselves with the freely available dataset, mapping the provided bounding boxes into a simpler categorization scheme of fracture/no fracture and filtering out irrelevant categories, dividing data into train/validation/test partitions to separate patients across dataset splits, training and optimizing a standard object detection like YOLO (the dataset is already formatted for ultralytics YOLO implementation), training and optimizing a competitive vision transformer model, inferring regions considered suspicious for fracture by each model, evaluating binary image-level accuracy/sensitivity/specificity and region-detection metric such as IoU for fracture detection for each model.

CANDIDATE REQUIREMENTS

Required Skills:

– Strong experience in PyTorch

Preferred (but not required):

– Prior experience working with convolutional neural networks and image data. If lacking this experience, candidates should have substantial deep learning experience in other domains and a willingness to learn how to work with image data in PyTorch.

– Prior experience with object detection models or vision transformers. If lacking this experience, candidates should be comfortable familiarizing themselves with these models during the project.

– Prior experience with medical imaging is not required.

Student Eligibility: Master’s, Senior, Junior

International Students on F1 or J1 Student Visa: Yes, eligible
School: School of Nursing

Department: Nursing Scholarship & Research

Project Overview:

This project investigates real-world clinician engagement with artificial intelligence (AI) tools in hospital settings using rich, time-stamped data from the NIH-funded CONCERN clinical trial. The CONCERN model is a machine learning-based early warning system that predicts patient deterioration and displays risk levels to healthcare professionals in near real-time. While many AI models have shown strong performance in development, few have been evaluated during real-world clinical deployment.

Our study will analyze one year of data collected at Columbia University Irving Medical Center, including electronic health record (EHR) usage logs, clinical outcomes, and AI alert logs for over 16,000 patients. We aim to understand when and how clinicians—particularly nurses and prescribing providers—respond to AI alerts and what actions they take. Specifically, we will: (1) use time series analysis to assess temporal patterns of AI use and clinical action; (2) apply clustering methods to categorize response patterns and identify which patients benefit most from AI-supported care; and (3) use natural language processing and large language models to extract information about clinicians’ decision-making from clinical notes.

A key innovation lies in expanding the focus to include nurses and other healthcare practitioners, whose roles in care delivery and response to AI alerts are equally crucial but often overlooked. This broader, multidisciplinary lens recognizes the collaborative nature of healthcare and allows us to capture the full scope of how AI-driven recommendations influence decision-making and patient outcomes.

This interdisciplinary effort integrates techniques in machine learning, NLP, and human-centered design. The end goal is to identify actionable insights for optimizing AI integration into clinical workflows and to develop generalizable data science methods for studying AI use in healthcare. Findings will inform future AI deployments and bridge the “implementation valley of death” between AI development and real-world clinical impact.

Responsibilities:

The DSI Scholar will play an integral role in advancing the analytic components of this project under the close supervision of the PIs and study team. Their primary responsibilities will include:

• Data Aggregation and Cleaning: The Scholar will assist with merging multiple data sources, including AI usage logs from the CONCERN clinical trial and structured electronic health record (EHR) data. This includes data wrangling, creating time-indexed variables, and ensuring data quality and consistency across datasets.

• Time Series Analysis (Aim 1): The Scholar will support the construction of multivariate time series datasets for each patient encounter and contribute to calculating key time lags (e.g., between AI score viewing and clinical actions). They will use time series methods to explore temporal patterns and correlations between AI alerts and clinician behaviors.

• Clustering Analysis (Aim 2): The Scholar will assist with applying unsupervised machine learning techniques, such as hierarchical agglomerative clustering, to identify distinct patterns in healthcare professionals’ responses to AI alerts. They will help characterize clusters based on patient characteristics and response timing.

• Documentation and Visualization: The Scholar will document all code, data processing steps, and analysis workflows to ensure reproducibility. They will also contribute to generating visualizations (e.g., heatmaps, time series plots) to illustrate findings.

• Throughout the project, the Scholar will receive structured mentorship in data science methods, healthcare AI, and ethical data practices, and will participate in regular team meetings to review progress and receive feedback.

CANDIDATE REQUIREMENTS

Required Skills:

– Fluency in R is a hard requirement.

Candidates should also have experience with the following:

– Time series analysis

– Working with longitudinal or time-stamped data

– Machine learning-based clustering techniques, such as hierarchical agglomerative clustering (e.g., tidyverse, lubridate, tsibble, cluster)

– Familiarity with EHR data structures or healthcare data more broadly is a strong plus.

– NLP tools

Most importantly, the scholar should be detail-oriented, collaborative, and eager to engage in interdisciplinary research.

Student Eligibility: Master’s

International Students on F1 or J1 Student Visa: Yes, eligible
School: Vagelos College of Physicians & Surgeons

Department: Pediatrics

Project Overview:
- GOAL: To define the low end of normal blood pressures in hospitalized children in order to set thresholds across a variety of disease states.
- BACKGROUND: Children who are hospitalized have their blood pressure measured at a minimum frequency of every 4 hours, to a maximum of 20Hz via an invasive arterial catheter. Low blood pressure for age/sex (“”hypotension””) is a known predictor for adverse outcomes in a variety of illnesses ranging from septic shock, to traumatic brain injury, to cardiac arrest. However, standard values are based on norms established by measurements in healthy children in ambulatory pediatric offices. These “”standard”” blood pressure measurements are not representative of hospitalized children, particularly those with illness states that predispose to hypotension, and as such are likely inappropriate to use as a gold standard in this setting. Hypotension cutoffs are used to make split-second decisions in life threatening situations, such as whether to start medications to support cardiac function, whether to give blood transfusions, and whether to start temporary extracorporeal life support.
- DATA: Continuously sampled and sparse patient-level data, such as vital signs, laboratory values, ICD codes, and nursing flowsheet data.
- MODELS: Simple linear and logistic regression, penalized regression, and machine learning classifiers such as gradient boosted regression and XGboost. Models will be used to predict patient deterioration across a variety of disease states. Blood pressure will be the primary predictor of interest.
Responsibilities:

• Cleaning dataset.

• Normalizing or transforming relevant variables.

• Multiple imputation of missing data.

• Data exploration, visualization, and presentation to a study team.

• Tuning and training models.

• Presentation of interim and final results.

CANDIDATE REQUIREMENTS

Required Skills:

– Fluency in R or Python (Primary investigator is fluent in R and familiar with Python.)

– Experience building ML models using commonly available tools and packages.

– Data visualization, such as ggplot2 or pandas.

– One of:

– (1) Prior experience working with inpatient health record data

– (2) Prior experience with time series data

Student Eligibility: Master’s, Senior, Junior

International Students on F1 or J1 Student Visa: Yes, eligible
School: School of Engineering and Applied Sciences

Department: Chemical Engineering

Project Overview:

In materials science and chemistry, machine learning models often represent chemical compounds by translating the positions and chemical species of a collection of atoms into constant-length feature vectors that obey physical invariances or equivariances. Various hand-crafted methods for this purpose have been proposed; ours (Artrith et al., Phys. Rev. B 96, 014112, 2017; and Gharakhanyan et al., arXiv, 2023, https://doi.org/10.48550/arXiv.2310.05386) is particularly efficient in terms of computational and memory demands. In recent years, learned representations, e.g., via graph neural networks, have gained popularity and have been shown to achieve high accuracy without relying on chemical intuition. However, these methods are significantly more computationally demanding, and it is not well understood whether learned representations transfer as effectively between different learning tasks as hand-crafted representations, which are, by design, independent of the specific learning task.

In this project, the DSI Scholar will assess the accuracy of our hand-crafted representation and state-of-the-art learned representations on our atomic-structure dataset across different learning tasks (energy prediction, force prediction, spectral data analysis), while also evaluating the associated computational costs. This work will be closely integrated with other ongoing projects in which we use both types of representations for various materials science applications. As such, the outcome of this project will not only inform our future model selection but will also be published to benefit the broader scientific community.

The DSI Scholar will:

• Translate our featurization method from pure Python to PyTorch to facilitate a direct comparison with other methods;

• Apply different featurization methods to our atomic-structure data sets; and

• Evaluate the accuracy, transferability, and computational cost for energy learning and NMR prediction.

CANDIDATE REQUIREMENTS

Required Skills: A good working knowledge of Python and PyTorch is a requirement.

Student Eligibility: Master’s

International Students on F1 or J1 Student Visa: Yes, eligible
School: Vagelos College of Physicians & Surgeons

Department: Radiology

Project Overview:

Pediatric growth prediction remains largely guided by manual staging and anthropometry, which can miss subtle changes in physeal cartilage microstructure. Our project applies radiomics to knee MRI exams to extract quantitative features—shape, first‑order intensity, and texture—from the distal femoral and proximal tibial physes. We have assembled a retrospective cohort of ~50 children in late childhood to early adolescence with baseline knee MRI and follow‑up height measurements at 4, & 12 months. After standardizing images (bias‑field correction, isotropic resampling) and performing semi‑automated segmentation, we’ll use PyRadiomics to harvest hundreds of features, then reduce dimensionality via correlation filtering and univariable screening. For continuous growth velocity prediction, we’ll train a linear regression; for categorical “low/average/high” growth, a Random Forest classifier. Model performance (R², RMSE, AUC) will be assessed via cross‑validation and held‑out testing. Ultimately, we aim to validate a radiomic signature that augments clinical decision‑making—identifying children likely to have accelerated or delayed growth, informing timing of interventions and monitoring in pediatric endocrinology and orthopedics.

The DSI Scholar will:

• Programming: Proficiency in Python (data handling with pandas, NumPy; imaging with SimpleITK or NiBabel).

• Statistical Modeling: Familiarity with regression, classification, cross‑validation, and hyperparameter tuning.

• Data Management: Ability to query and merge imaging and clinical datasets.

CANDIDATE REQUIREMENTS

Required Skills:

– Programming: Proficiency in Python (data handling with pandas, NumPy; imaging with SimpleITK or NiBabel).

– Statistical Modeling: Familiarity with regression, classification, cross‑validation, and hyperparameter tuning.

– Data Management: Ability to query and merge imaging and clinical datasets.

Student Eligibility: Master’s, Senior, Junior, Sophomore, Freshman

International Students on F1 or J1 Student Visa: Yes, eligible
School: Vagelos College of Physicians & Surgeons

Department: Zuckerman Mind Brain Behavior Institute | SNF Center for Precision Psychiatry & Mental Health

Project Overview:

The nervous system gives rise to behavior, and behavior reflects pathological brain function. The recent emergence of machine learning tools has revolutionized behavioral analysis by enabling the accurate tracking of animal body parts over time (pose estimation). More recently developed approaches perform unsupervised clustering of behavior by segregating poses into brief postural sequences and revealing how these sequences are integrated and concatenated over time (behavioral analysis). In our laboratory, we apply such pipelines to study the structure and temporal organization of behavior in different mouse models of schizophrenia susceptibility. This approach allows us to link distinct behavioral pathologies to specific genetic etiologies of the disease and to use these patterns as translational biomarkers for assessing the therapeutic efficacy of novel treatment strategies within the framework of precision psychiatry.

As we investigate how genetic mutations lead to aberrant behavior, we record brain activity dynamics—either at the population level (using fiber photometry) or at cellular resolution (using miniaturized fluorescent mini-scopes)—in freely behaving mice. This enables the identification of brain activity and neural circuit signatures associated with high-dimensional behavioral alterations, providing critical insights into the neurobiological correlates of schizophrenia-relevant phenotypes.

Toward this goal, we work with structured, preprocessed datasets consisting of multivariate time series: high-dimensional behavioral data-frames derived from unsupervised clustering, temporally aligned with neural activity traces recorded from multiple brain regions during the same behavioral sessions. These curated datasets allow for powerful computational approaches to quantify the relationships between specific behavioral clusters and neural dynamics, thereby revealing pathological signatures associated with different schizophrenia mutations. By developing integrated analysis pipelines across behavioral and neural data, we aim to uncover how specific brain circuits drive pathological behavior in models of schizophrenia liability, and how these circuits can be modulated by pharmacological interventions.

The DSI Scholar will work with pre-processed datasets consisting of time-aligned, high-dimensional behavioral dataframes and curated neural activity files. Neural data will include either time series of bulk activity traces from multiple brain regions or individual cellular activity traces within a single region. Both behavioral and neural datasets will be fully curated and temporally aligned for each experimental subject (i.e., mouse recording). The DSI Scholar will be responsible for developing computational pipelines in Python to integrate and analyze these datasets.

Primary tasks will include:

• Quantifying and visualizing neural activity aligned to specific normal or pathological behavioral events (using timestamps of identified behavioral clusters to extract corresponding neural data).

• Performing cross-correlation analyses between signals from different brain regions to uncover behavior-linked functional coupling or decoupling.

• Identifying statistical differences across experimental groups (e.g., mutant vs. drug-treated mice).

The scholar is expected to deliver initial versions of analysis scripts or to optimize existing ones. Running the full analysis pipeline across all experimental datasets will not be required; however, students interested in deeper involvement will be fully supported in extending the project.

CANDIDATE REQUIREMENTS

Required Skills:

– The student is expected to have proficiency in Python and familiarity with fundamental data analysis libraries, including NumPy, SciPy, pandas, Matplotlib, and Seaborn, along with experience in code development, pipeline construction, and debugging.

– A solid understanding of basic statistics, including significance testing, is required. The student should also be comfortable with mathematical concepts commonly used in signal analysis, such as Fourier transforms, correlation analyses, and filtering techniques.

– Familiarity with basic machine learning methods—including linear regression, classification, and clustering—is expected. Experience with deep learning, particularly training and evaluating models on GPUs, is considered a plus.

– Importantly, the student should have a genuine interest in applying their computational skills to psychiatric neuroscience and be enthusiastic about contributing to a collaborative, interdisciplinary research environment.

Student Eligibility: Master’s, Senior, Junior

International Students on F1 or J1 Student Visa: Yes, eligible
School: Vagelos College of Physicians & Surgeons

Department: Biomedical Informatics

Project Overview:

Mental illness affects over 50 million U.S. adults each year and presents a growing public health challenge. Traditional diagnostic methods often rely on self-reporting during therapy sessions, overlooking critical behavioral signals in everyday life. This project aims to harness the power of passive sensing data (e.g., from smartphones and wearables) and large language models (LLMs) to improve early detection of mental health conditions such as depression and suicidal ideation.

Building on existing datasets—GLOBEM (focused on depression) and MAPS (focused on suicidal thoughts)—we will develop and refine LLM-enhanced machine learning models that predict mental health outcomes directly from longitudinal sensor data. We’ll explore innovative ways to encode and summarize behavioral data using generalizable embeddings and LLM-based reasoning, with a focus on improving both prediction accuracy and interpretability.

In parallel, we will begin prototyping a clinician-facing dashboard that provides interpretable summaries and allows for natural language interaction with the data and predictions. While the dashboard build-out will be limited for this semester, if we make fast progress, we may explore integrating early visualizations or basic LLM-generated summaries.

Students on this project will gain hands-on experience with time-series behavioral data, deep learning and LLMs (e.g., OpenAI, HuggingFace), and health-focused HCI design. This is an interdisciplinary opportunity to contribute to real-world mental health care innovation at the intersection of AI, behavioral sensing, and clinical decision support.

The DSI Scholar will:

• Preprocess and Analyze Passive Sensing Data

• Clean, transform, and extract meaningful features from time-series behavioral data (e.g., GPS, phone usage, screen time) collected in the GLOBEM and MAPS datasets.

• Develop and Fine-Tune LLM-Based Prediction Models

• Assist in training and evaluating models that predict mental health outcomes using LLMs and deep learning frameworks (e.g., PyTorch, HuggingFace Transformers).

• Design and Evaluate Embedding Strategies

• Explore encoding techniques to convert raw sensing data into formats suitable for LLMs, aiming to improve generalizability and model performance.

• Conduct Baseline Comparisons

• Benchmark LLM-based models against traditional machine learning or deep learning models to evaluate performance and interpretability.

• Prototype Dashboard Elements

• Contribute to early-stage design or development of the clinician-facing dashboard by creating static or interactive visualizations (e.g., using Streamlit, Dash, or Plotly).

• Literature Review and Weekly Meetings

• Stay up to date on relevant AI, mental health, and HCI research. Participate in weekly lab meetings to present progress and get feedback.

• Optional Stretch Goal

• If time allows, help integrate basic LLM-generated explanations or summaries of data patterns into the prototype dashboard.

CANDIDATE REQUIREMENTS

Required Skills:

– Fluency in Python and common AI/ML packages (e.g., Scikit-Learn, PyTorch).

– Experience in working with time-series data from mobile / wearable devices or something similar.

– Experience in playing with gen AI tools (e.g., OpenAI, Gemini).

Student Eligibility: Master’s, Senior, Junior

International Students on F1 or J1 Student Visa: Yes, eligible
School: Columbia Business School

Department: Economics Division

Project Overview:

This project will model global energy systems using PyPSA (Python for Power System Analysis)-Earth, an open-source general equilibrium model for simulating and optimizing global cross-sectoral energy networks. The purpose is to analyze pathways for transitioning to renewable energy by 2050, focusing on cost-effective and sustainable configurations. The scholar will contribute to developing and validating scenarios for renewable energy integration, grid expansion, and storage deployment at regional and global scales. We will use public and non-public data, including historical energy consumption, renewable resource potential (e.g., solar, wind), cost data, and grid infrastructure data from sources like ENTSO-E and IRENA. The project will investigate the effect of different learning curves on the deployment of different renewable energy technologies, ultimately providing insights into the trade-offs between carbon budget constraints and cost efficiency. Key outputs will include scenarios comparing fossil fuel phase-out strategies across regions, visualizations of energy mixes and cost estimates, and actionable insights for business leaders, policymakers, and stakeholders. The project aligns with global sustainability goals and requires handling large-scale, geospatial time-series data.

The DSI Scholar will:

• Data Preparation: Preprocess and clean global energy datasets, including demand profiles, renewable resource potentials, and grid infrastructure data, ensuring compatibility with PyPSA.

• Model Configuration: Customize PyPSA-Earth workflows to simulate specific regions or scenarios, assist in building and running PyPSA-Earth models under supervision.

• Optimization Runs: Execute simulations on research clusters using solvers like Gurobi, and analyze results for cost, emissions, energy balance, and technology deployment. Analyze model outputs to identify trends in energy scenarios.

• Visualization: Create (interactive) maps and time-series plots to communicate findings using Python libraries like Matplotlib or Seaborn.

• Documentation: Maintain clear records of methodology, document workflows, and contribute to open-source model improvements as well as academic papers and key insight pieces summarizing findings. The scholar will work closely with the faculty supervisor to iterate on model parameters and validate results.

CANDIDATE REQUIREMENTS

Required Skills: Fluency in Python, including standard packages.

Preferred (but not required):

– Familiarity with geospatial data handling (e.g., Geopandas) and large and complex datasets and types (e.g., Xarray).

– Familiarity with workflow management libraries like Snakemake and data handling on HPCs.

– Basic understanding of energy systems and linear optimization and familiarity with PyPSA or similar energy modeling tools is helpful but not required.

Student Eligibility: Master’s, Senior, Junior

International Students on F1 or J1 Student Visa: Yes, eligible

Past Projects

Spring 2025

School: Vagelos College of Physicians & Surgeons

Department: Psychiatry

Project Overview: Decoding behavioral signifiers for the brain states and decisions can have far reaching implications for understanding the neural basis for actions and identifying disease. We are using high resolution video recordings of mice as they navigate mazes but have access to very few pre-determined behavioral signifiers. Computer vision can be used to extract a variety of previously unreachable aspects of behavioral analysis, including animal pose estimation and distinguishable internal states. These descriptions allow for the identification and characterization of behavioral dynamics, which determine decision making. Applying such computational approaches to mice during exploration and in the context of behaviors that have been validated to measure choice and memory can reveal dimensions of behavior that predict or even determine psychological constructs like vigilance, arousal, and memory. We are also obtaining neural signal data, which can be aligned with the behavioral signifiers.

DSI scholars would use pose estimation analysis to evaluate behavioral signifiers for choice and memory and relate it to our real time concurrent measures of neural activity and transmitter release. The students would also have the opportunity to examine the effect of disease models known to impair performance on our tasks on any identified signifier.

CANDIDATE REQUIREMENTS

Required Skills: MATLAB, Python, familiarity with statistics

Student Eligibility: Master’s, Senior, Junior, Sophomore, Freshman

International Students on F1 or J1 Student Visa: Yes, eligible
School: Fu Foundation School of Engineering and Applied Science

Department: Earth and Environmental Engineering

Project Overview: The Western United States is facing intensifying regional droughts and escalating wildfire risks, both of which are projected to worsen under climate change. In response, cloud seeding has gained renewed interest as a potential tool for augmenting water supplies and mitigating wildfire risks. Currently, ten states and provinces in Western North America operate cloud seeding programs. However, the overall efficacy of cloud seeding remains contentious, largely due to the challenges of distinguishing its effects from natural meteorological variability and due to concerns about the effects of pollutants on human populations. Moreover, operational strategies for cloud seeding are hampered by limitations in our fundamental understanding of cloud microphysics and the difficulty of simulating these processes under realistic atmospheric conditions.

Since 1972, the Weather Modification Reporting Act has mandated the documentation of weather modification activities in the United States. This project aims to compile and analyze these historical records to provide a comprehensive overview of weather modification efforts over the past five decades. The study will utilize large language models (LLMs) to extract and synthesize key information from reports, creating a unique dataset that tracks the prevalence and context of weather modification technologies. This dataset will be cross-referenced with historical climate data to examine the meteorological conditions under which cloud seeding has been deployed, offering insights into its potential efficacy.

To further refine the analysis, an Invariant Causal Prediction Framework will be employed to identify consistent patterns in the use of weather modification technology in relation to climatic drivers. By integrating historical records, climate data, and causal inference methods, this project will provide a nuanced understanding of the role weather modification has played in managing water resources and mitigating climate risks in the Western United States.

DSI Scholar Responsibilities include:

1. Compile historical records on the usage of Weather Modification in the US since 1972.

2. Use LLM’s to synthesize information from 1026 past historical records of Weather Modification Usage to develop a data set of Weather Modification usage including the locations, dates, materials, and purpose of Weather Modification Activities.

3. Cross-reference locations and dates with historical climate data sets to understand the context under which Weather Modification has previously been used.

4. Investigate an Invariant Causal Prediction framework to identify consistent patterns in the use of Weather Modification Technology.

CANDIDATE REQUIREMENTS

Required Skills:

– Fluency in Python or R (preference for python)

– Experience with LLM’s

Student Eligibility: Master’s, Senior, Junior

International Students on F1 or J1 Student Visa: Yes, eligible
School: Fu Foundation School of Engineering and Applied Science

Department: Department of Industrial Engineering and Operations Research

Project Overview: The objective of this project is to develop a comprehensive framework for preemptively assessing the safety of self-driving cars in a new urban environment prior to their deployment. The project will start from database construction, by processing heterogeneous raw data from police reports in natural language to street views in satellite images. Based upon that, we will develop innovative transfer learning methods for the evaluation of existing driving algorithms, and construct a traffic simulator to analyze future algorithms. Leveraging counterfactual analysis, we aim to inform the regulatory decisions surrounding the introduction of self-driving cars. Additionally, we will explore post-entry safety assessment mechanisms for ongoing monitoring and improvement.

The DSI Scholar will:

– Integrate multiple datasets from different sources into a database for convenient query.

– Preprocess data using large language models and image recognition techniques.

– Build risk models to estimate the accident rate of autonomous driving vehicles in several environments.

CANDIDATE REQUIREMENTS

Required Skills: Proficiency in Python, especially its common machine learning libraries. Experience with natural language processing.

Student Eligibility: Master’s, Senior, Junior

International Students on F1 or J1 Student Visa: Yes, eligible
School: Columbia Business School

Department: Finance

Project Overview: This project aims to explore the implications of U.S. labor mobility on employers’ decisions regarding health insurance plans for their employees. According to Census data, employers cover over 86% of insurance plans for private employees in the U.S. These employer-based insurance plans are vital for the well-being of employees and their families, yet they also represent a significant operational cost for companies. As labor mobility and job turnover increase, employers may face reduced incentives to offer robust risk-sharing in health plans, as the benefits of investing in healthier workers can be easily lost to competitor firms through poaching. This dynamic is contributing to the rise in high-deductible health plans, which shift more cost and risk onto employees.

The project leverages a comprehensive regulatory index, which I have already collected and constructed, detailing how each state regulates and enforces non-compete agreements in labor contracts. This index will help us understand how changes in state regulations, which influence labor mobility, ultimately affect the health insurance coverage provided by employers.

The project utilizes raw data from IRS Form 5500s, which provides insights into firms’ insurance choice decisions, and a proprietary commercial insurance claim dataset covering more than 40 million enrollees in employer-based plans, offering granular data on individual enrollment and health expenditures. Our goal is to establish causal evidence that links changes in labor mobility, induced by state policies, to firms’ insurance supply decisions and individual medical utilization. To achieve this, we will employ statistical methods such as econometrics and causal inference. Additionally, the DSI Scholar will be tasked with applying Natural Language Processing (NLP) algorithms to process and analyze the raw Form 5500 data, ensuring that we extract meaningful insights to inform our study.

The DSI Scholar will play a crucial role in the successful execution of this project, focusing on data processing, analysis, and methodological application. Their responsibilities will include:

1. Pre-processing the IRS data

a) Utilize Natural Language Processing (NLP) techniques to extract and structure relevant information from raw IRS Form 5500 data. This will involve parsing text data to identify and categorize insurance plan details and related variables.

b) Link the cleaned Form 5500 data to external databases such as S&P Compustat

2. Analysis of Commercial Insurance Plan Data:

a) Identify the insurance plans utilized by individual enrollees in the commercial insurance claim dataset. Explore the features of these insurance plans, such as deductibles, co-pays, and coverage options, and investigate any possible plan switches among enrollees over time.

b) Analyze medical expenditures for inpatient and outpatient visits, identifying trends and patterns that may inform the broader study on labor mobility and insurance choices.

3. Data Analysis

a) Conduct exploratory data analysis to uncover initial patterns, trends, and insights within the datasets. The DSI Scholar will generate descriptive statistics and visualizations to provide a clear understanding of the data landscape.

b) Support the application of econometric models and causal inference techniques to assess the impact of labor mobility on employer health insurance decisions. This may include running regression analyses, propensity score matching, or instrumental variable approaches under supervision.

4. Documentation and Reporting

CANDIDATE REQUIREMENTS

Required Skills:

– Proficiency in Python for data processing and analysis with skills in Natural Language Processing (NLP) for text data extraction and analysis.

– Strong foundation in econometrics and causal inference techniques with experience with longitudinal data analysis. Proficiency in at least one of the following statistical software: Stata, R, and SAS.

– Effective communication skills for documenting processes and presenting results.

Student Eligibility: Master’s, Senior, Junior, Sophomore

International Students on F1 or J1 Student Visa: Yes, eligible
School: Vagelos College of Physicians & Surgeons

Department: Systems Biology

Project Overview: Background: We have recently developed ‘SCRuB’, a machine learning model that removes contamination from microbiome samples by analyzing datasets of collected DNA to infer their true microbial components (Austin et al., Nat Biotechnology 2023). We showed that this method, through a unique expectation maximization framework, improves the power of microbiome research, allowing for stronger clinical applications ranging from cancer to preterm birth. Despite SCRuB’s success, we know there is opportunity for more improvements.

Project: The aim of this project is to extend our existing SCRuB method by incorporating even more biological structures into its expectation maximization model. While the original method effectively incorporates microbiome compositions, the method would improve by developing statistical frameworks that would allow it to utilize other biological data commonly available in microbiome research.

This project will be conducted in three steps

1. designing the machine learning methodology that allows SCRuB to effectively use additional biological data points;

2. implementing the software using a programming language of your choice;

3. evaluating how your implementation can strengthen the power of a microbiome analysis.

All steps will involve close collaboration with members of the lab. Upon the successful culmination of the project, the student will be encouraged to publish their findings as a peer-reviewed manuscript as well as present at a scientific conference.

CANDIDATE REQUIREMENTS

Required Skills: Preferred backgrounds would include: Data Science; Machine Learning; Expectation Maximization; Python, R or other languages used for data science. Experience with biology is not a strict requirement.

Student Eligibility: Master’s, Senior, Junior

International Students on F1 or J1 Student Visa: Yes, eligible
School: CUIMC

Department: Psychiatry

Project Overview: Cognitive flexibility is an executive function that is necessary to flexibly adapt previously learned behaviors to changing environmental demands. This cognitive function enables an individual to “look at things differently” and to adapt to one’s environment, instead of engaging in perseverative thinking that can lead to rumination and mental rigidity. However, the neurobiological mechanisms underlying cognitive flexibility in healthy and disease-relevant conditions are largely unknown.

The goal of this project is to employ state-of-the-art machine learning analysis of mouse neuroimaging and behavior data to understand cellular and neural circuit mechanisms regulating cognitive flexibility. In our experiments, real-time single cell neural activity data was recorded with head-mounted miniature microscopes from a large population of neurons in freely moving mice while they were trained to perform a complex decision-making task. In this task, mice had to learn that a set of features (odor, texture, and location) was associated with a hidden food reward. Upon learning the initial feature-reward association over 30 trials, the reward predicting features were changed and the mice had to learn that a different set of features was now associated with reward. Using machine learning techniques, some of which were developed in our lab, we want to understand how neural representations of feature-reward associations emerge in the brain and how the dynamic evolution of these representations during trial-and-error experience impact decision-making behavior.

This exciting data science project utilizes highly innovative in vivo Ca2+ imaging data sets of neural activity from freely behaving mice with and without in vivo neural circuit manipulations, providing students the opportunity to apply computational analyses techniques to provide unprecedented insight into how the brain controls behavior.

The student will work closely with other Ph.D. students and postdocs who will provide hands-on training, and will be mentored by the PI through regular meetings. The main analyses techniques include Representation Similarity Analyses (RSA) to determine how the brain represents information by comparing the similarity of neural response patterns across different stimuli or conditions. Representation Evolution Analyses (REA) utilize support vector machines (SVMs) and linear classifiers to determine trial-based neural and behavioral response patterns, followed by cosine similarity analyses to determine changes in the neural coding axis over the course of learning and reversal learning. All analyses pipelines and scripts are available in the lab for the student to use.

The complexity of the data set will provide ample opportunities for the student to learn, develop, and apply different types of computational analyses. We have clear hypotheses for the student to test with our established analysis pipeline. In addition, the complexity of the data also provides opportunities for the student to develop their own new questions to ask from the data set. The data set is clearly defined so the student can get started on the analyses without delay. The student will acquire mentorship from a lab with leading experience in analyzing scientific data and a successful track record in supervising students.

CANDIDATE REQUIREMENTS

Required Skills:

Some experience with Python programming, especially with the sklearn library would be beneficial. Linear classifiers and representational similarity analysis (RSA) are the main tools we use, and we have a pipeline for a new analysis that was developed in our lab, which we named “Representational Evolution Analysis (REA)”. This analysis leverages support vector machines (SVMs) and linear classifiers to determine how neural representations dynamically evolve as a function of learning, and how they flexibly adapt during reversal learning/ cognitive flexibility. Training on this new analysis pipeline will be provided through hands-on training by the student mentor.

Student Eligibility: Master’s, Senior

International Students on F1 or J1 Student Visa: Yes, eligible
School: Arts & Sciences

Department: Department of Earth & Environmental Sciences

Project Overview: Climate and oceanographic observations provide us with a valuable view of a changing world, yet they are limited to little more than a single human lifespan. In order to consider these observations in the broader context of the passage of time, proxy data from a range of archives can be utilized. These data have provided powerful evidence of abrupt climate changes in the past, and implicate the important role of the ocean in these changes. Although proxy data are immensely useful and remain the only way to assess natural variability in the climate system, they are often scattered in space and discontinuous in time, presenting a barrier to their full utilization.

This project involves the compilation and visualization of climatic and oceanographic datasets from the last time the Earth was as warm as today. These data represent important characteristics and processes, including sea-surface temperature, continental and sea-ice, ocean currents, and deep ocean carbon storage, initially recorded in deep-sea sediments and subsequently analyzed in paleoclimate laboratories around the world. The project will involve compiling these data from online repositories and other sources, and then using interpolation schemes to generate a series of visualizations in the form of maps and cross sections during different intervals (“time-slices”) through the past warm interval that will render the existing information more accessible to climate scientists, oceanographers, policy-makers and the general public.

The individual visualizations will be useful as stand-alone time-slices through the progression of a warm climatic interval that was analogous to the modern, but without the intervention of human interactions. The sequence of visualizations may also be combined into video animations that portray the previous natural evolution of the ocean and climate during a warm interval that can be compared directly to ongoing changes.

With guidance from the PI, the DSI scholar will initially compile the data from online sources. They will then assemble them spatially and temporally in order to generate maps and oceanographic cross sections. These visualizations will require the development and application of interpolation schemes to turn the scattered data into continuous views that provide a state-of-the-art estimate of oceanographic and climatic conditions from each of ten intervals of time from the previous warm interval. This is likely the main and central accomplishment of the project, although additional steps may include generating animated visualizations with interpolations through time as well as space, and the comparison of maps and ocean sections to the modern equivalents in order to evaluate the anomalies associated with human influence on the climate system. Through the course of the project, the DSI scholar will have the opportunity to interact with other members of our research group, including undergraduate and graduate students, and will have the option to spend time at the Lamont-Doherty Earth Observatory campus.

CANDIDATE REQUIREMENTS

Required Skills: Fluency in Python or data analysis packages such as MatLab will be helpful, although not required. Similarly, experience with data mining techniques may be advantageous, but will not be necessary.

Student Eligibility: Master’s, Senior, Junior, Sophomore

International Students on F1 or J1 Student Visa: Yes, eligible
School: Vagelos College of Physicians & Surgeons

Department: Radiation Oncology

Project Overview: Radiotherapy is a cornerstone of cancer treatment that utilizes ionizing radiation to destroy malignant cells. By accurately delineating, or segmenting, tumor target and surrounding organ-at-risks (OARs), the treatment planning process will guide the treatment machine to deliver precise radiation dose to tumor while sparing healthy surrounding tissues to minimize side effects. Despite the advances of end-to-end deep learning models in automated medical image segmentation, due to the inherent challenges in cone beam computed tomography (CBCT) such as low soft tissue contrast and limited image quality, the current fully automated segmentation methods usually fail to consistently achieve satisfactory results for clinical use. As a result, their outputs may require significant manual adjustments, which has become a bottleneck in time-sensitive practices such as online adaptive radiotherapy (oART).

The main purpose of this DSI scholar project is to develop an artificial intelligence (AI)-driven interactive tool for image segmentation in adaptive radiotherapy using visual prompted foundation models and reinforcement learning. This tool will be developed through two special aims: 1) Development of a web-based interface: Create a user-friendly web interface that accepts user inputs – such as clicks, scribbles and bounding boxes – to guide the interactive segmentation process. The segmentation will be powered by visual prompt-based foundation models that are adapted for CT images. 2) Optimization of interactive contour refinement. Optimize the dynamic process of contour refinement through reinforcement learning, aiming to achieve desired segmentation with the fewest possible iterations.

We have identified the roadmap to expand our web-based automated image segmentation system to an interactive tool. Our desired end goal is that this interactive tool can significantly shorten oART treatment time. This will reduce the risk of patient movement during treatment, offering potentially more effective treatment options for cancer patients.

The DSI Scholar will:

1. Implement a web-based interface that takes user’s inputs (prompt) to guide interactive segmentation.

2. Finetune Segment Anything Model (SAM) 2 using LoRA for 3D CT images, and make it work on the system developed in 1.

3. Assist in investigating approaches to optimize the dynamic process of contour refinement with the initial results obtained from automated segmentation algorithms for efficient and effective contour refinement.

4. Present results to the group and prepare for potential publication or further development.

CANDIDATE REQUIREMENTS

Required Skills:

1. Familiarity with web programming using JavaScript/HTML/CSS and WebGL

2. Fluency in Python and PyTorch

3.Experience with medical image analysis using packages such as ITK and MONAI

4. Experience with reinforcement learning is desired

Student Eligibility: Master’s, Senior

International Students on F1 or J1 Student Visa: Yes, eligible
School: Columbia Climate School

Department: Lamont Doherty Earth Observatory

Project Overview: Phytoplankton are tiny photosynthetic organisms that live in the sunlit areas of oceans and freshwater bodies. They play a crucial role in converting CO2 dissolved in water into organic compounds that sustain nearly all marine life, while producing over half of the oxygen in our atmosphere. Due to their ability to fix CO2, phytoplankton are vital for understanding carbon sequestration, climate regulation, and supporting fisheries. With around 5,000 species, studying them is essential for monitoring the health of aquatic ecosystems and life on Earth.

Traditionally, microscopy has been used to study phytoplankton, but it is slow, costly, and labor-intensive. While newer imaging technologies have sped up this process, they still require manual handling and expert classification. At Lamont-Doherty Earth Observatory, we modified a commercially available imaging system to automate the imaging of particles and plankton in water samples. This system can continuously capture phytoplankton images while a ship is moving, allowing data collection across large areas and over time. In the last two years, we have field-tested this system, amassing millions of images from oceans, coastal areas, and rivers.

However, the slow manual classification process is still a challenge. Our goal is to overcome this by developing a Computer-Assisted Automated Phytoplankton Classification System (CAPCS) using advanced computer vision and deep learning techniques. This will enable rapid, accurate identification of phytoplankton species based on unique features, transforming data collection.

This innovation is critical for NASA’s hyperspectral ocean color sensors, like PACE, EMIT, GLIMR, which aim to detect major phytoplankton groups from space. Overcoming these challenges will revolutionize water quality, marine pollution, climate change, and fisheries sciences, meeting the growing demand for high-resolution data from both field and satellite observations.

DSI Scholar Responsibilities

1. Develop AI Models:

– Design and implement deep learning models for phytoplankton image classification.

– Apply computer vision techniques to improve accuracy and efficiency.

2. Data Management:

– Clean and preprocess large phytoplankton datasets.

– Use data augmentation to enhance model robustness.

3. Optimize Algorithms:

– Test and refine AI algorithms to address limitations and improve performance.

– Stay updated with advancements in AI and machine learning.

4. Collaborate Interdisciplinary:

– Work with Goes and other researchers to integrate AI with ecological and environmental sciences.

– Bridge computer science, statistics, and environmental science in research efforts.

5. Evaluate Models:

– Assess model performance through rigorous validation and cross-validation.

– Ensure accuracy and robustness of AI solutions.

6. Documentation and Reporting:

– Document methodologies and results thoroughly.

– Prepare reports and presentations indicating progress of the work

CANDIDATE REQUIREMENTS

Required Skills:

1. Machine Learning & Deep Learning:

– Proficiency in implementing machine learning algorithms, especially Convolutional Neural Networks (CNNs), and advanced deep learning methods like Recurrent Neural Networks (RNNs) and Transformers.

2. Computer Vision:

– Strong understanding of image processing, object detection, and segmentation for analyzing phytoplankton and microplastic images.

3. Feature Selection & Dimensionality Reduction:

– Knowledge of methods to manage and optimize high-dimensional data.

4. Statistical Analysis:

– Foundation in statistical methods, including spatial statistics, for robust data interpretation.

5. Programming Skills:

– Proficiency in Python or R

6. Model Evaluation & Optimization:

– Skills in evaluating and optimizing machine learning models for enhanced performance.

Student Eligibility: Master’s, Senior

International Students on F1 or J1 Student Visa: Yes, eligible
School: Columbia Climate School

Department: Lamont Doherty Earth Observatory

Funding Note: This is a grant funded project. Exact amount of funding will depend on hours completed.

Project Overview: Ocean color remote sensing has long been used to map phytoplankton functional types (PFTs) in the upper ocean, traditionally relying on the ratios of photosynthetic pigments chlorophyll-a and accessory (non-photosynthetic) pigments like chlorophyll-b and carotenoids. However, these methods often fall short in distinguishing complex PFT compositions due to overlapping pigment absorption peaks and the limited spectral resolution of traditional multi-spectral ocean color sensors.

The advent of hyperspectral remote sensing, notably through NASA’s PACE mission and the upcoming GLIMR mission , offers continuous spectral coverage from the ultraviolet to near-infrared wavelengths, significantly enhancing the ability to differentiate between various phytoplankton pigments. Hyperspectral data capture detailed spectral features, which are critical for accurate pigment identification and PFT classification that is important for fisheries, carbon sequestration and climate change studies.

Recent advancements incorporate Artificial Intelligence (AI) techniques such as Linear spectral unmixing, Independent Component Analysis, Gaussian Mixture Models, Finite Mixture of Skewed Components (FMSC), etc., to overcome limitations of traditional algorithms. Traditional methods decompose pigment absorption spectra into Gaussian components, but these often face challenges with overlapping absorption peaks and limited spectral resolution. The FMSC algorithm, however, encodes spectral shapes in a finite metric space, providing a more nuanced representation of spectral data and improving the accuracy of pigment retrieval.

This study will utilize HPLC pigment data obtained from the field and use hyperspectral ocean color data for:

1. Developing AI and other complex statistical methods to improve the accuracy of distinguishing between complex mixtures of pigments.

2. Use field pigment datasets to evaluate the performance of various algorithms against conventional spectral decomposition techniques.

3. Apply the algorithms developed to satellite data for improved global monitoring and analysis of PFTs from space.

The Scholar will:

1. Data Acquisition: Assist in obtaining and managing field hyperspectral optical, and HPLC pigment data from NASA SEABASS database for field ocean color data.

2. Data Cleaning and Preprocessing: Prepare hyperspectral datasets for analysis by removing noise and normalizing data.

3. Algorithm Development and Implementation: AI Algorithm Development: Implement and train AI models for pigment retrieval using hyperspectral data. This involves coding, testing, and optimizing machine learning algorithms.

4. Algorithm Integration, Analysis and Validation: Apply and refine various statistical methods to hyperspectral data sets for accurate pigment and PFT identification. Analyze spectral data to extract relevant features and validate the accuracy of the FMSC and AI algorithms: Compare the performance of various AI approach with traditional methods Gaussian Curve fitting methods

5. Data Interpretation and Reporting: Translate algorithmic pigment outputs into meaningful insights about phytoplankton communities and their spatial distributions.

6. Data Management and Documentation: Refine code and prepare detailed workflow for testing by other ocean color scientists. Prepare reports, be willing to give a presentation at a NASA meeting and be part of publications.

7. Application of the algorithms to Satellite fields of hyperspectral ocean color data from PACE to generate regional and global maps of PFTs.

CANDIDATE REQUIREMENTS

Required Skills:

– Fluency in R and or Python and experience in working with big data files in particular ncdf format files.

– Capable of querying databases, extracting and pairing of datasets for algorithm development and algorithm performance evaluation.

– Knowledge of the use of AI based statistical approaches for extracting pigment information from hyperspectral datasets.

– Using these algorithms for mapping phytoplankton functional types from satellite data.

Student Eligibility: Master’s, Senior

International Students on F1 or J1 Student Visa: Yes, eligible
School: Fu Foundation School of Engineering and Applied Science

Department: Computer Science

Project Overview: Existing quantum platforms, such as IBM’s Qiskit allow off-site users to access the platforms. Long-term, we are interested in using minor architecture discrepancies to identify the specific machine a computation has been performed on (much like fingerprinting, PuFs for classical systems). More specifically, current quantum hardware requires high levels of error correction techniques to maintain the states of computation. Our approach is pick simple computations, run them on the various machines and observe statistical differences in the syndromes used to indicate how to perform error correction. Currently, we are using support vector machines to perform the inference, but would like to consider alternative classification strategies.

CANDIDATE REQUIREMENTS

Required Skills: Familiarity with various ML methods (SVMs, Decision trees, neural nets, transformers) and/or familiarity with interfaces to systems/packages that can apply these methods to collect results.

Student Eligibility: Master’s, Senior, Junior

International Students on F1 or J1 Student Visa: Yes, eligible
School: Arts & Sciences

Department: Earth and Environmental Sciences

Project Overview: The ocean carbon sink accounts for roughly 25% of annual anthropogenic CO2 emissions. To understand past changes, diagnose ongoing changes, and to predict the future behavior of the ocean carbon sink, we must understand its spatial and temporal variability. To monitor this key climate service, in particular, air-sea CO2 fluxes across the globe are needed for us to monitor year to year changes. However, the ocean is poorly sampled, and the sparsity of measurement in space and time makes the estimates of such fluxes challenging. In the McKinley group, we have developed several Machine Learning (ML) techniques to reconstruct the ocean carbon field based on association to satellite-based full-field driver data. These machine learning algorithms interpolate sparse surface ocean pCO2 observations to global coverage.

Understanding the value of different data sources to these ML algorithms is an active area of ML research. The spatio-temporal nature of the observed data makes it difficult to understand the impact of specific observations on the performance of the ML estimation. This DSI Scholar will develop approaches to quantify the contribution of individual pCO2 observations to ML interpolation algorithms using Explainable ML methods.

More specifically, with the Data Shapley framework (Ghorbani and Zou, 2019), we plan to assign a specific value, or score, to each data point in the available database. We will also quantitatively evaluate how alternative sampling patterns would change algorithmic skill. To do this, we will use a multi-model, multi-ensemble ‘testbed’, as we have in a range of previous studies.

In Spring 2025, the DSI Scholar will begin by learning about methods and data needed for this project, such as the pCO2-Residual product (Bennington et al. 2022, JAMES, doi:10.1029/2021MS002960) and output from Earth System Models (ESMs) which are used for the testbed. They will review existing code and help develop improved workflows with a strong focus on data sharing and reproducibility. They will then work with us to begin to implement the Data Shapley for data valuation. The student will also contribute to analysis of the reconstructed ocean carbon field and be included in publications resulting from this work.

CANDIDATE REQUIREMENTS

Required Skills: Strong Python and ML skills are required – please discuss both in your application.

Student Eligibility: Master’s

International Students on F1 or J1 Student Visa: Yes, eligible
School: School of International and Public Affairs/Arts & Sciences

Department: Economics

Project Overview: We have a number of projects looking at the diffusion of legal ideas in the United States and Canada. One project involves learning the structure of citations and the diffusion of ideas in the US federal judiciary. We have the text of all Federal cases back to 1800, along with the network of citations cases make to each other. We want to look at “”breakthrough”” federal cases that replace all future citations to the things they cite, or which have embedding distance far from the things they cite, but close to the cases that cite them. Then we will use this to rank influential legal cases in US history, and we will ask our GPT-4o based summarizer to translate them into accessible language.

The other project uses an existing corpus of collective bargaining contracts in Canada, and the DSI scholar will scrape the universe of judicial labor arbitration cases. The idea is that the language and concepts articulated in judicial decisions will diffuse into the text of collective bargaining agreements, as lawyers coordinate the judicial language. We will look at embedding distances between contracts and arbitration opinions.

The DSI scholar will a) process the judicial opinions dataset and implement the two breakthrough measures, and b) scrape text data from the CANLII database. We will use s-bert to measure embeddings.

CANDIDATE REQUIREMENTS

Required Skills: Python, and specifically expertise with networks and embeddings would be helpful.

Student Eligibility: Master’s, Senior

International Students on F1 or J1 Student Visa: Yes, eligible
School: Climate School

Department: Seismology, Geology and Tectonophysics

Project Overview: Volcanology has transformed into a highly data-driven and computationally focused field. Numerous computational models have been developed to simulate various physical phenomena during volcanic eruptions. A critical component of forecasting the behavior of volcanic eruptions is assessing the probability of different outcomes and scenarios. For this, scientists implement probabilistic modeling approaches, which provide such assessments. The DSI scholar who will join this project will be tasked with creating workflows that perform simulations and generate hazard maps with quantitative probability assessment. The workflows will be created using Jupyter Notebooks and be based on a range of eruption simulation tools written in different languages. The workflows will utilize probabilistic tools such as Markov Chain Monte Carlo (MCMC) or the Ensemble Kalman Filter (EnKF). The objective of this project is to provide a comprehensive and flexible tool for members of the volcanology community to utilize. In the long term, this tool could be expanded to use in any geophysical process that requires uncertainty quantification.

CANDIDATE REQUIREMENTS
- Required Skills:
- – Fluency in Python is required.
- – Familiarity with statistical methods and notation is highly preferred.
- – We expect the scholar to be comfortable with reading academic papers or other high-level readings to familiarize themselves with the concepts.
- – Other scripting skills and programming language knowledge will be beneficial to work as well.
Student Eligibility: Master’s, Senior, Juniors only.

International Students on F1 or J1 Student Visa: Yes, eligible

Fall 2024

School: Vagelos College of Physicians and Surgeons

Department: Ophthalmology

Project Overview:

Motivated by the global prevalence of untreated vision impairment, we seek to address the need for more accurate and timely diagnosis for irreversible eye diseases. Traditionally, ophthalmologists rely on 2D optical coherence tomography (OCT) reports derived from raw 3D OCT data, but this approach can lead to errors, especially in cases of atypical ocular anatomy or imaging artifacts. Although raw 3D OCT data provides a more comprehensive view of the retina, its complexity and time-intensive analysis present significant hurdles to its practical use. This DSI Scholars project has two aims: (1) develop a deep learning (DL) model that transforms seamlessly between 2D OCT reports and 3D underlying OCT data, and (2) visualize this 3D data for ophthalmologists through augmented reality/virtual reality (AR/VR).

Aim 1: Working closely with ophthalmologists at CUIMC, we will acquire annotations within temporal regions of 3D OCT data that relate to anatomical features of importance in 2D OCT reports, providing 3D-to-2D mapping information to train our DL model. Architectures to be explored include 3D UNets, autoencoders, or 2D Vision Transformers applied on subsets of slices from 3D OCT volumes (employing self-attention and cross-attention across patches from multiple slices).

Aim 2: With our generative 2D-to-3D transformation model from Aim 1, we will design a method for clinicians to visualize synthesized 2D and 3D data simultaneously through AR and/or VR. This will involve integrating the model developed in Aim 1 into the Unity Engine for real-time processing of inputs and visual rendering for ophthalmologist users.

By implementing these aims, our 2D-to-3D AI transformation and AR/VR visualization system will empower ophthalmologists with comprehensive insights derived from 3D OCT data. By extracting features not accessible through traditional 2D analysis alone, our approach has the potential to assist in expediting care for those suffering from vision impairment.

CANDIDATE REQUIREMENTS

Required Skills:

– Fluency in Python and Pytorch

– Awareness of/willingness to probe literature related to 3D UNets, Vision Transformers, Autoencoders

– Familiarity with Unity Engine, Augmented Reality/Virtual Reality development and workflow

– Interest in ophthalmology, ability/interest to work collaboratively in a team of engineers and physicians

Student Eligibility: Master’s

International Students on F1 or J1 Student Visa: Yes, eligible
School: Zuckerman Mind Brain Behavior Institute

Department: Biology

Project Overview:

Naturalistic animal behavior is built from simpler behavioral modules that reflect the organization and function of the underlying neural circuits.

To understand the parental behavioral differences among Peromyscus mice, we track freely moving adult mice while they are retrieving pups that where removed from their nest. We use marker-less 3D pose-estimation software based on machine learning (SLEAP and DeepLabCut) and extract meaningful parameters including velocity, acceleration, turning, orientation of the adult to the pup, and distance between pup and the adult mouse. We already have a test dataset of ~200 individual pup retrieval sequences for which we extracted the kinematic parameters.

Next, we want to classify individual pup retrieval sequences based on these parameters using random forest or autoregressive hidden Markov model algorithms to identify behavioral modules that the adult animal is routinely performing during this task.

In the future, we want to use these trained models to predict behavioral modules on a new dataset for which we also recorded the mice’s brain activity.

CANDIDATE REQUIREMENTS

Required Skills: The ideal candidate would be comfortable with basic probability as well as multivariate calculus and linear algebra. They will have to implement models and algorithms in Python, so coding proficiency is important. No biological background is strictly needed as we will teach the candidate everything that is needed to successfully finish the project.

Student Eligibility: Master’s, Senior, Junior, Sophomore, Freshman

International Students on F1 or J1 Student Visa: Yes, eligible
School: Climate School

Department: International Research Institute for Climate and Society

Project Overview:

Sub-seasonal-to-seasonal (S2S; weeks to months) time-scale predictions have great potential societal benefits, such as early warnings of heavy rains, droughts, and heat waves. Reliable forecasts a few weeks ahead can provide invaluable tools for routine planning in the agriculture, water resources, public health, and humanitarian aid sectors. However, the skill of current S2S forecasts made using large physics-based climate model ensembles is limited, partly because model simulations depart from reality. Calibration of climate-model forecast probabilities is necessary to account for model deficiencies and produce reliable forecasts. State-of-the-art calibration methods in IRI’s climate predictability tools are linear, which is a limitation because the atmospheric flow is inherently non-linear, and model errors often grow exponentially.

The goal of this project is to develop machine learning/artificial intelligence (ML/AI) forecast tools that enable non-linear bias correction to meet the growing service demands on improved forecast products at S2S time scales. The intern will code and run test cases to compare the performance of different ML methods (e.g., Regression Trees, CNNs, deep learning) to improve Indian summer monsoon probabilistic forecast skill by bias-correcting/calibrating sets of S2S forecast ensembles from large physics-based climate models run at global climate forecasting centers (e.g., NCEP, ECMWF) and archived in IRI Data Library.

CANDIDATE REQUIREMENTS

Required Skills:

– Fluency in Python coding and libraries, Jupyter Notebooks, using GitHub repos.

– Experience with using various ML methods (e.g., Regression Trees, CNNs, deep learning) is required.

– Experience with climate data and model output would be an advantage, but not required.

Student Eligibility: Master’s, Senior, Junior, Sophomore

International Students on F1 or J1 Student Visa: Yes, eligible
School: Vagelos College of Physicians and Surgeons

Department: Radiology

Project Overview:

Mild traumatic brain injury (mTBI), also known as concussion, remains largely invisible on standard MRI images despite survivors of car accidents or falls or domestic dispute victims report neurological and cognitive symptoms. Some patients recover within months, but as many as 60% remain symptomatic at 6 months and 30% or more suffer for years and many for the rest of their lives. Lack of accepted and easily adopted clinical diagnostic tools severely limits identification of the subgroup with poor prognosis, development of interventions and treatment options as well as advancement in understanding the underlying mechanisms of the injury.

We use diffusion tensor imaging (DTI), a widely available quantitative MRI technique to detect and visualize subtle abnormalities in microstructure of white matter of the brain in these patients. Patient’s images are compared to those of healthy controls in voxel-by-voxel manner to localize areas of abnormality. Computationally, this is a very CPU-intense process, which requires tasks including image processing, image registration and robust statistics.

The main goal of this project is to develop an AI-based algorithm to speed up identification of localized regions of abnormalities. A large data repository, including processed images from healthy controls and mTBI patients, is available within the Translational Neuroimaging Laboratory at CUIMC to develop and train the algorithm and perform its testing. The project aims to achieve the following targets for key performance indicators of the new algorithm:
1. Speed: identify abnormality in less than 10 min on an AMD Threadripper 3.9GHz CPU with NVIDIA Quadro T600 video card.
2. Overall quality metric: true positive 99% ; true negative 99%.
3. Patient specific quality metric when abnormality is detected: Dice index against existing implementation is better than the existing implementation against itself shifted by 1mm in three cardinal directions and main diagonals for that patient.
CANDIDATE REQUIREMENTS

Required Skills: Experience working with various ML/AI models (e.g. RESNET, UNET, Inception, VGG), documentation/organization skills, familiarity with Linux

Student Eligibility: Master’s, Senior, Junior, Sophomore, Freshman

International Students on F1 or J1 Student Visa: Yes, eligible
School: Climate School

Department: Seismology, Geology and Tectonophysics

Project Overview:

Volcanology has transformed into a highly data-driven and computationally focused field. Numerous computational models have been developed to simulate various physical phenomena during volcanic eruptions. A critical component of forecasting the behavior of volcanic eruptions is assessing the probability of different outcomes and scenarios. For this, scientists implement probabilistic modeling approaches, which provide such assessments. The DSI scholar who will join this project will be tasked with creating workflows that perform simulations and generate hazard maps with quantitative probability assessment. The workflows will be created using Jupyter Notebooks and be based on a range of eruption simulation tools written in different languages. The workflows will utilize probabilistic tools such as Markov Chain Monte Carlo (MCMC) or the Ensemble Kalman Filter (EnKF). The objective of this project is to provide a comprehensive and flexible tool for members of the volcanology community to utilize. In the long term, this tool could be expanded to use in any geophysical process that requires uncertainty quantification.

CANDIDATE REQUIREMENTS

Required Skills:

– Fluency in Python is required.
– Familiarity with statistical methods and notation is highly preferred.
– We expect the scholar to be comfortable with reading academic papers or other high-level readings to familiarize themselves with the concepts.
– Other scripting skills and programming language knowledge will be beneficial to work as well.

Student Eligibility: Master’s, Senior, Junior, Sophomore

International Students on F1 or J1 Student Visa: Yes, eligible
School: School of Nursing

Department: Nursing

Project Overview:

The DSI Scholar will work on cleaning and analyzing data for two projects focused on the influence of daily discrimination on cardiovascular disease risk. The first project is our DSI Seed Grant, which was a 30-day daily diary study that investigated the impact of daily discrimination on sleep health in a sample of Black and Latinx LGBTQ+ Adults. Although we have 3 papers under review and 2 papers in progress from the DSI Seed Grant, we have additional data that needs to be analyzed. Specifically, we plan to use unsupervised machine learning to identify sleep phenotypes and their associations with daily discrimination. The next project, which was recently funded by the National Heart, Lung, and Blood Institute, is a 1-week daily diary study that investigates the influence of anticipated and vicarious discrimination on home blood pressure. The DSI Scholar will assist our team with maintaining our study database, completing ongoing data cleaning, tracking data collection, developing code for data analysis, and addressing any data concerns (as needed).

CANDIDATE REQUIREMENTS

Required Skills:

– The DSI Scholar should have fluency in R/Python and an interest in health disparities research.
– Familiarity with machine learning and multilevel modeling is preferred but not necessary.
– We have longitudinal actigraphy and daily diary data that the DSI Scholar will help analyze but prior experience is not needed.

Student Eligibility: Master’s

International Students on F1 or J1 Student Visa: Yes, eligible
School: Mailman School of Public Health

Department: Epidemiology

Project Overview:

The goal of this project is to predict climate related physiologic stress on participants in a study on overland migration and nutrition security. The project has two leads and mentors (M. Orjuela-Grimm and Robbie Parks (Environmental Health Sciences)). Tasks to fulfill study aims include working with time- referenced and georeferenced data from reported migration trajectories of 104 Latin American overland migrants during the summer and fall of 2023, and possibly summer 2024. The data needs to be matched to date specific climate variables (ambient air temperature (considering daily ranges), humidity) and geographically specific elevation, and then modeled to consider changes that may act as physiologic stressors, taking into consideration trajectory (departure point) and geographic challenges. Each migrant will have daily data points from 14 to 100 days. Data sources will include ERA5, ERA5-Land as well as other sources. Modeling strategies may include estimating heat stress with Wet Bulb Globe Temperature and Heat Index.

The data can potentially be combined / compared with indicators from the water insecurity experience scale collected from the same population.

The end goal is to create a method to approximate such stressors and model their potential impact on health related indicators in migrants in overland migration routes. Ultimately the data will be used to help inform health related service provision at migrant shelters in Mexico. The work would be expected to result in data that would serve for an abstract submission at the end of the fall semester, with subsequent poster presentations, and potential manuscript submission. The data is from a multinational pilot study funded by the Institute of Latin American Studies.”

CANDIDATE REQUIREMENTS

Required Skills:

– Skill sets include fluency in R, Python, GitHub, project management / Python, as well as an interest in geographic information, climate aspects, and a working knowledge of Spanish (fluency is an advantage) and of geography in Central America and Mexico.

Student Eligibility: Master’s, Senior

International Students on F1 or J1 Student Visa: Yes, eligible
School: Fu Foundation School of Engineering and Applied Science

Department: Biomedical Engineering

Project Overview:

Significance: Reward learning is a core cognitive function that allows humans and other animals to consistently make decisions that optimize behavior and enhance survival. Deficits in reward learning are common in neuropsychiatric disorders, impairing patients’ ability to efficiently interact with their environment. Despite extensive studies in rodents, the generalizability of reward learning mechanisms to humans remains poorly understood. This research aims to close this gap by identifying the computational principles that govern reward learning in the brain. Understanding these mechanisms is critical for identifying disruptions in reward-based processes associated with disorders, thereby improving biomarker identification and enhancing health analytics for better clinical outcomes. Furthermore, by decoding the biological principles of reward learning, this research could lead to the development of a new class of energy-efficient reinforcement learning (RL) models that employ cortical coding schemes.

Approach: This research project focuses on probing the neural codes and computational algorithms the brain uses to tackle dynamic reward-learning tasks. We are particularly interested in tasks where the optimal solution not only changes over time but is also influenced by the policies of other agents within the environment. To investigate this, we will create a simulated environment initially featuring a single agent, with a second agent introduced later. Each agent will be constructed as a biologically-plausible reinforcement learning model, each employing distinct, time-varying learning rules. By simulating a wide range of learning rule functions for each agent, we aim to elucidate biological reward-learning mechanisms at play both in individual scenarios and in more complex settings where an agent’s policy is affected by dynamic environmental changes, including the strategies adopted by other agents. Working alongside myself and our interdisciplinary team of computational and systems neuroscientists, a DSI scholar will play a critical role in developing this biological multi-agent RL platform and systematically reverse-engineering the agents dynamics.

CANDIDATE REQUIREMENTS

Required Skills:

– ML/AI, deep learning models (prior experience in reinforcement learning (RL) models is strongly preferred), advanced programming, foundational knowledge in linear algebra, calculus, and statistics.

Student Eligibility: Master’s, Senior

International Students on F1 or J1 Student Visa: Yes, eligible
School: CUIMC

Department: Psychiatry

Project Overview:

In our lab we record fluctuations in neurotransmitter levels in the brains of mice in real time while they undergo tests of learning and decision-making. This is accomplished by measuring fluorescent signals from genetically encoded optical biosensors using fiber photometry (Simpson et al., Neuron. 2024, PMID: 38103545). Our overall goal is to understand the neurobiological basis of behaviors disrupted in psychiatric and neurological disorders. The specific aim of this project is to determine how expecting effort influences choice. Effort-based decision making is variable across healthy individuals (from “work-shy” to “workaholic”). For many psychiatric patients an exaggerated weighting of anticipated effort results in debilitating apathy and amotivation.

We collected dopamine recordings from multiple brain regions simultaneously in mice performing effort-based decision tasks in our custom automated test chambers. The DSI Scholar will work on this dataset together with the PI and the lab members that designed, collected, and pre-processed the data. The scholar will use non-linear multiple regression to determine which task events and behavioral measures (including dichotomous and continuous variables) predict the dopamine signals in each brain region. Because some segments of the behavior are self-paced, we will use dynamic time-warping to align some events. Because different physiological processes modulate dopamine release on different timescales, we will also perform dynamic regression modeling by adding lags as explanatory variables.

Expected Outcomes:
- Identification of task contingencies and behavioral events that predict changes in dopamine signals across timescales.
- Quantification of relationships between dynamic dopamine signals across different brain regions.
- The information gained will inform future experiments (optogenetic manipulations of dopamine neuron activity) to test causality and direction of dopamine-behavior relationships.
- Write-up of the analysis for presentations and an original research article.
- The potential for adapting the developed regression models for other data sets (different behavioral paradigms, neurotransmitters, and brain regions).”
CANDIDATE REQUIREMENTS

Required Skills:

– Python, multiple regression models, time series analysis, documentation (e.g. jupyter notebooks), and data/code sharing platforms e.g. OSF, Github).

– Neuroscience background knowledge is a plus, but basic biology will suffice (all aspects of the data collection and biological relevance will be explained).

– An interest in psychology/psychiatry related research and a desire to work collaboratively with the research team (including undergrads, grad students, postdocs, and associate research scientists).

Student Eligibility: Master’s, Senior, Junior

International Students on F1 or J1 Student Visa: Yes, eligible

Summer 2024

School: Columbia Climate School

Department: Advanced Consortium on Cooperation, Conflict and Complexity (AC4)

Project Overview:

“Hate Speech” in on-line media can incite conflict and violence in real life. We study “”Peace Speech”” that leads to positive prosocial behaviors that support sustainable peaceful conditions in nations throughout the world. Our interdisciplinary team at the Advanced Consortium on Cooperation, Conflict and Complexity (AC4) at the Columbia Climate School, includes researchers in psychology, social psychology, environmental sustainability, natural resource governance, applied anthropology, and data science. We have already successfully used machine learning to identify the words in on-line news media that best classifies countries as lower or higher peace, published in 2023 in PLOS ONE, https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0292604.

In those and subsequent studies, we used logistic regression, random forest, XGBoost, SVM, BERT, and XLNet to analyze on-line data from both news media and social media. In this new work we will substantially extend those studies by using new and powerful methods from Artificial Intelligence (AI): 1) to further identify the linguistic differences between lower and higher peace societies, 2) to reveal the social processes that underly those linguistic differences, and 3) to create a real-time dashboard of the levels of peace and the processes that support them. To accomplish these tasks we will use pre-trained AI systems, such as ChatGTP, Claude, and Bard, as well as fine tuning those systems with additional data from studies of the social psychology of peace. Because of implicit and/or explicit bias in the data used to train those proprietary models and the “guardrails” that limit their responses, we may need to explore our own training of open source models such as Llama 2 from Meta and Mixtral from Mistral. This work will advance our scientific understanding of the social factors that enhance peace as well as provide valuable, practical insights for policy makers to support

CANDIDATE REQUIREMENTS

Required Skills: Fluency in Python, natural language processing (NLTK, spaCy, BERT, XLNet), longitudinal analysis (time series), machine learning (logistic regression, random forest, XGBoost, SVM, TensorFlow, PyTorch). The project will be centered on using AI, familiarity with models like openAI’s ChatGTP, Anthropic’s Claude, Google’s Bard, Meta’s Llama, Mistral’s Mixtral and tools like prompt engineering in AI chat models and fine tuning and training methods using langchain, vector databases such as Pinecone, models like davinci-002 and Ada, will be very helpful. The short term goals are to use AI to characterize the properties of “Peace Speech”, and to identify the social processes that they represent. The longer term goal is to create a user friendly dashboard to monitor the current levels of peace in societies for academic research and policy makers.

Student Eligibility: Master’s

International Students on F1 or J1 Student Visa: Yes, eligible
School: School of Social Work

Department: Social Work

Project Overview:

We are conducting a pilot study designed to assess the feasibility and potential promise of the large language model (LLM)-based artificial intelligence (AI) chatbot approach to assist current/future service providers working with LGBTQ+ populations in learning and utilizing the latest science-based knowledge about LGBTQ+ issues and intervention. During the June – August 2024 period, the project goals are to finalize and implement evaluation and benchmarking tools to assess the quality (i.e., validity/accuracy compared to the empirical scientific knowledge base) and utility of outputs from popular and promising existing LLM AI chatbots. The evaluation and benchmarking tools involve human- and machine-driven approaches. We seek a DSI-supported skilled student to assist and participate, particularly in the machine-driven evaluation/benchmarking, as well as in identifying prevalence and conditions that result in chatbot hallucinations. As interested and appropriate, the student could also assist with the wider study activities (e.g., human-driven evaluation/benchmarking, developing and/or training with an appropriate corpus, publication, and presentation of findings, and grant writing). We also anticipate areas where the student may contribute/develop their areas of interest/specialization outside of the current pilot study (e.g., methodological issues with experimental research with LLM AI chatbots). We note that there should be DSI Scholar appropriate work in the September – December 2024 timeframe as well.

The types of tasks that might be required to fulfill the study aims include:

– Implement machine-driven evaluation of the study’s selected chatbots

– Use appropriate statistical or data analysis tools

– Learn and contribute regarding data provenance and detailed records of research procedures

– Ensure all research activities comply with ethical, equity, and safety standards.

– Attend project team meetings and perform administrative activities as needed

– Contribute and/or lead presentations and publication of findings, implications, etc.

CANDIDATE REQUIREMENTS

Required Skills:

– Understanding of concepts and techniques used in LLM and LLM implementation.

– Skills in text processing, language modeling, and understanding the nuances of human language (natural language processing).

– Programming and Software Engineering: Proficiency in programming languages like Python and knowledge of software development practices and tools.

– Ability to work with large datasets, including data cleaning, analysis, and visualization.

– Designing robust and scalable systems to support machine learning applications.

– Knowledge of cloud services and distributed computing for training and deploying large language models.

– Skills in designing user-friendly interfaces and understanding user needs for applications like ChatGPT.

– Keeping up with the latest AI research and being able to implement or adapt new findings.

– Understanding of the ethical, equity, and safety implications of AI and developing systems responsibly.

– Working effectively in multidisciplinary teams and communicating complex concepts clearly.

– Writing and analytical skills.

Student Eligibility: Master’s, Senior, Junior

International Students on F1 or J1 Student Visa: Yes, eligible
School: Arts and Sciences / Climate School

Department: Earth and Environmental Science / LDEO

Project Overview:

The ocean significantly mitigates climate change by absorbing fossil fuel carbon from the atmosphere. Cumulatively, since preindustrial times, the ocean has absorbed 40% of emissions. Marine Carbon Dioxide Removal (mCDR) are proposed engineered efforts to supplement the ocean’s natural uptake of anthropogenic CO2 from the atmosphere. A major challenge for mCDR is to quantify the additional carbon removal from the atmosphere given the large natural background carbon sink. Better understanding of the natural air-sea CO2 fluxes at regional scales is therefore required before mCDR additionality can be quantified.

To understand past changes, diagnose ongoing changes, and to predict the future behavior of the ocean carbon sink, we must understand its spatial and temporal variability. However, the ocean is poorly sampled and so we cannot do this directly from in situ measurements. In the McKinley group, we have developed several data science techniques to reconstruct ocean carbon data based on association to satellite-based full-field driver data. With this project, we wish to determine how well current and future ocean carbon observations can constrain background air-sea CO2 fluxes in potential mCDR deployment regions.

In summer 2024, the DSI Scholar will begin by learning about methods and data needed for this project, such as the pCO2-Residual product (Bennington et al. 2022, JAMES, doi:10.1029/2021MS002960) and output from Earth System Models (ESMs). They will review existing code and help develop improved workflows with a strong focus on data sharing and reproducibility. We look forward to having their expertise to improve machine learning methods in order to produce pCO2 products at smaller scales, specifically in areas of potential mCDR deployment. The student will also contribute to analysis of the reconstructed ocean carbon data and be included in publications resulting from this work.

CANDIDATE REQUIREMENTS

Required Skills: Fluency in Python, experience with foundational ML

Student Eligibility: Master’s

International Students on F1 or J1 Student Visa: Yes, eligible
School: Mailman School of Public Health

Department: Epidemiology

Project Overview:

This highly innovative and significant Data Science Institute Seed Project application will use a machine learning informed natural language processing (NLP) approach to qualitatively identify patterns and reasons for engaging in opioid-related polysubstance use and narratives around overdose and HIV risk behaviors from publicly available discussion forums on Reddit, a popular social media platform, which provides a ready-made source of abundant, naturalistic, first-person narratives for understanding substance use behaviors and patterns. This work takes an interdisciplinary approach by integrating data science, substance use epidemiology, and public health to improve our understanding of polysubstance use patterns. We propose to use human-in-loop machine learning approach, specifically NLP method, to analyze the patterns from unstructured Reddit comments to automatically cluster large similar unstructured text data and unearth latent patterns of polysubstance use and qualitatively explore the trends, patterns, and themes. Data collection for this project will rely on a “human-in-loop” or “supervised” natural language approach with the following steps:

1. data retrieval from opioid-related subreddit of interests,

2. feed algorithm with key drug terms to develop polysubstance use topics,

3. use the algorithm developed topics to extract polysubstance relevant subset of data,

4. select a random sample of the data, and

5. conduct a rapid review of the sample.

We will follow steps two through five until the random sample consists of polysubstance use posts, overdose, and HIV related behaviors. Data will be analyzed using directed content analysis, using Latent Dirichlet Allocation (LDA) to infer latent substance use topics from the comments posted by redditors. Four focus groups ranging from four to eight participants will be recruited to ecologically validate the NLP findings and capture the lived experiences of people who engage in opioid-related polysubstance use among people who use drugs.

CANDIDATE REQUIREMENTS

Required Skills: Fluency in R/Python, methods for Natural Language Processing, Latent Dirichlet Allocation (LDA), sentiment analysis, supervised and unsupervised machine learning, predictive modeling, etc.

Student Eligibility: Master’s

International Students on F1 or J1 Student Visa: Yes, eligible
School: School of Engineering and Applied Science

Department: Civil Engineering and Engineering Mechanics

Project Overview:

The reliability of public charging infrastructure is paramount for the successful transition to road transportation electrification. Consumers need to perceive it as dependable to consider shifting to electric vehicles (EV) or avoid reverting back to internal combustion engines. To ensure a reliable charging infrastructure network, faulty or unusable chargers need to be swiftly identified and repaired.

While standard monitoring can detect several failures, such as those in software and the electrical system, other failures like broken connectors or physical impediments hindering drivers from successfully charging are not currently captured [1]. Addressing these issues often requires expensive physical monitoring or relies on customer reports. However, a shift from typical charging point utilization may indicate the potential presence of undetected faults.

This project aims to explore a variety of alternative unsupervised learning techniques for anomaly detection. The goal is to identify and predict anomalous EV charging point use in public charging points using publicly available charging transaction data. The project will also analyze the relationship between the occurrence of anomalies or their duration and the characteristics of the charging point, such as venue type, location, and pricing category. Additionally, normal utilization metrics will be examined to identify any patterns related to maintenance issues.

The project utilizes public nationwide data from the US Department of Energy (specifically the EV-WATTS datasets) as well as other datasets.

[1] Karanam, V., Tal, G. (2024) Enhancing Electric Vehicle Charger Reliability: Developing a Tool to Swiftly Detect Hidden Charger Faults, Poster Presentation, 2024 TRB Annual Meeting.

CANDIDATE REQUIREMENTS

Required Skills: The student will be proficient in python programming and time series data modeling. LSTM autoencoders models are amongst the time series anomaly detection techniques in time series that will be tested.

Student Eligibility: Master’s, Senior

International Students on F1 or J1 Student Visa: Yes, eligible
School: Vagelos College of Physicians & Surgeons

Department: Emergency Medicine

Project Overview:

In febrile infants younger than 30 days, lumbar puncture (LP) is a procedure routinely performed to evaluate for meningitis. LPs are mainly performed in the emergency setting by clinicians and trainees. However, novice success rates are historically poor with over 60% failure rates that can lead to diagnostic uncertainty, prolonged pain, and unnecessary resource utilization. Reduction of unsuccessful and traumatic LPs in infants can improve diagnostic ability and reduce patient harm. Ultrasound performed at the point-of-care has the potential to increase LP success rates through improved visualization of the anatomy, however it is dependent on the skill of the operator to interpret findings accurately thereby limiting it’s efficacy in the population of providers that most needs it.

The main purpose of this project is to use a pre-existing ultrasound database of ultrasound spinal anatomy videos to develop an artificially intelligent algorithm that can identify the important anatomic structures for planning an infant lumbar puncture procedure.

We have already successfully designed a binary classification system using a limited dataset. Our next step is to work on object localization to help identify specific anatomic features of interest.

The specific aim is to design an object localizer for specific spinal anatomy using a corpus of ultrasound data and test accuracy of algorithmic feature recognition against expert labels in a hold-out set. Our secondary aim is to deploy the algorithm on a website or tablet to test real-time processing of ultrasound data.

To fulfill this aim, the team will need to achieve the following tasks:

1. Assist with object-level annotation of features

2. Use machine learning to develop intelligent algorithm for automated feature recognition

3. Test algorithm accuracy against expert gold standard

4. Deploy algorithm on website or local tablet to test real-time processing of data

We have a labelled data-set of 1515 frames with binary classification of anatomic features and an augmented dataset of 11224 frames.

Our desired end goal is a functional algorithm that can identify key features on spinal anatomy on ultrasound at a threshold of >95% accuracy.

CANDIDATE REQUIREMENTS

Required Skills: Experience working with various ML/AI models (e.g. RESNET, ALEXNET, VGG), documentation/organization skills (e.g. jupyter notebook, github), html (optional for parsing real-time algorithm).

Student Eligibility: Master’s

International Students on F1 or J1 Student Visa: Yes, eligible
School: Columbia College

Department: Latin American and Iberian Cultures

Project Overview: This project forms part of my book manuscript, Sorcery and the City in Post-Slavery Brazil. My project analyzes 135 witch trials that occurred during the first half of the Twentieth-century in Brazil to better understand why colonial anti-witchcraft made a comeback during the first decades of abolition and the first Brazilian Republic. My thesis is that witchcraft accusations were a means to uphold spatial and social divides and segregate cities like Rio de Janeiro, without the need to create racial segregation in written law. Witch hunts allowed the police to uphold state ideologies of racial and class divisions.

The main type of data I have collected are street addresses of where accused witches lived in Rio de Janeiro during the period from 1881-1942. I would like to work with a data science assistant to help me map these addresses onto old and contemporary maps of Rio de Janeiro to do a spatial-historical analysis to determine if these witch hunts did indeed reinforce spatial divides and segregation.

CANDIDATE REQUIREMENTS

Required Skills: The research assistant should have cartography/mapping skills to visualize geographical space (Rio de Janeiro city, state, and neighborhoods). RA should be able to work with historical maps, create maps, and use contemporary maps (google maps).

Student Eligibility: Master’s

International Students on F1 or J1 Student Visa: Yes, eligible
School: Graduate School of Arts and Science

Department: Ecology, Evolution and Environmental Biology

Project Overview: The Urban Wildlife Information Network (UWIN) was created by the Urban Wildlife Institute at the Lincoln Park Zoo as an alliance of urban wildlife scientists committed to conducting research to enhance our knowledge of urban wildlife and their relationships with people. While the UWIN project spans multiple universities and other stakeholders across the world, within NYC alone we at the Eco-Epidemiology Lab at Columbia University have a transect of nearly 50 wildlife cameras placed in parks and greenspaces along an urbanization gradient from Brooklyn to the furthest reaches of Nassau County. Our intent is to measure the effects of human occupancy and degrees of urbanization on wildlife and disease vectors-species richness and abundance.

A study of this scale comes with an ever-increasing amount of data, and in our case, this data comes in the form of hundreds of thousands of images of NYC’s local wildlife! While processing this information is traditionally done by staff, students, and volunteers pouring through these images and identifying the number and species of wildlife in each image, we are modernizing our approach with machine-learning AI technologies (such as Megadetector) to automatically detect and identify the species and quantity of wildlife present in these images, then attach this information to the image’s metadata and upload it the larger inter-city UWIN database. While we plan for this project to continue for many years, we are looking for students now to help create and implement a machine-learning model to identify and catalog our current and future sets of raw images by training said model on our 200,000+ already manually processed images as well as developing a pipeline to automate the processing of re-training of the model on future sets of images.

CANDIDATE REQUIREMENTS

Required Skills: Some Python coding experience is required, and anything beyond is a plus. Previous experience with machine-learning and/or image analysis is preferred, but not necessary. No previous knowledge of wildlife identification or ecological principles is needed, but an interest in the natural sciences and local wildlife is highly encouraged.

Student Eligibility: Master’s, Senior, Junior, Sophomore, Freshman

International Students on F1 or J1 Student Visa: Yes, eligible
School: Vagelos College of Physicians and Surgeons

Department: Pediatrics

Project Overview:

We are looking for a student who will join our studies on the impact of the prenatal environment on brain development. We have developed and are studying a unique mouse model for placental dysfunction that has autism-like behaviors, particularly in male offspring (Vacher et al., Nat Neurosci, 2021). We have RNA sequencing data from multiple brain regions from both mice that had placental insufficiency and matched controls across development. We have examined some of these data sets already but we now aim to analyze the RNA sequencing data specifically from the hippocampus, a critical brain region involved in memory and mood regulation.

The student will utilize bioinformatics tools to analyze RNA sequencing data from the mouse hippocampus. They will identify genes and pathways that are differentially expressed and associated with placental dysfunction and autism. This analysis will be conducted at different developmental stages to identify any deviations in its developmental trajectory of the hippocampus in our autistic model compared to neurotypical brains. The project will also investigate the influence of biological sex, a significant factor in autism. Furthermore, the student will perform statistical analyses to determine the significance of the findings, taking into account variables such as genotype, sex, and age.

Expected Outcomes:

– Identification of differentially expressed genes and pathways associated with placental dysfunction and autism in the hippocampus

– Insights into the molecular mechanisms underlying the link between placental dysfunction and autism

– Contribution to scientific knowledge through research publications and presentations

CANDIDATE REQUIREMENTS

Required Skills: The student should be proficient in R. Familiarity with R packages for RNAseq analysis such as DESeq2, ggplot, and GSEA, as well as visual presentation of sequencing data is a plus. Interest in developmental biology, neuroscience or medicine would be advantageous.

Student Eligibility: Master’s

International Students on F1 or J1 Student Visa: Yes, eligible
School: Vagelos College of Physicians and Surgeons

Department: Zuckerman Mind Brain Behavior Institute | SNF Center for Precision Psychiatry & Mental Health

Project Overview:

Nervous system gives rise to behavior and behavior reflects pathological brain function. Understanding the pathophysiology underlying mental disorders and providing innovative therapeutic avenues requires the detailed study of symptomatology in experimental animal models. During the last decade, pose estimation approaches are revolutionizing animal tracking. Researchers from Columbia University have developed a state-of-the-art machine learning package, namely the Lightning Pose (LP), that tracks freely-moving animals’ pose, enabling to study behavior with unprecedented accuracy. This package provides the 3D coordinates of behaving mice body parts that can then be subjected to various sophisticated analyses of behavior, including its dissection into regressive modules, and the analysis of their transition probabilities through sequences of behavior.

Present project aims to analyze the LP-generated mouse pose data, using available (Keypoint-Moseq, VAME), currently developing (Lightning) and custom-made machine learning or mathematical- and statistical- modeling analysis pipelines. This will allow us to gain novel insights on the effects of rare mutations that are considered to be the strongest etiological factors of schizophrenia currently identified, on behavior and associate them with disease symptomatology. Additionally, this will allow us to configurate a high throughput working pipeline to assess the effects of conventional, and innovative experimental therapeutic approaches in the framework of precision psychiatry.

The student will be working with csv files containing multivariate time series of x,y coordinates for a set of mouse body parts extracted from video data via previously existing algorithms (LP). In this context, the student will use python to apply mathematical and statistical tools under the guidance of the supervisor in order to:

– Detect repetitive patterns in the mouse pose that lead to the identification of behavioral modules/motifs.

– Assess transition probability across these patterns.

– Highlight the differences among different experimental groups (i.e. mutant or drug-treated mice).

CANDIDATE REQUIREMENTS

Required Skills:

– The student is expected to have fluency in python (numpy, scipy, pandas, matplotlib, seaborn) with experience in code writing, pipeline building and debugging.

– Basic statistics, including an understanding of significance testing.

– Basic machine learning (linear regression and classification, clustering).

– Experience with deep learning is a plus (training and evaluating models on GPUs).

– Experience modeling time series is a plus (RNNs, NLP/text analyses, HMMs, Kalman Filters).

– Importantly, the student should be interested in applying their skills to psychiatric neuroscience, and to actively participate in a collaborative working environment.

Student Eligibility: Master’s, Senior

International Students on F1 or J1 Student Visa: Yes, eligible
School: Columbia University Irving Medical Center

Department: Department of Biomedical Informatics

Funding Note: This is a grant funded project. Exact amount of funding will depend on hours completed.

Project Overview:

Electronic health records (EHR) provide a population-scale resource to improve the diagnoses of rare diseases, which go unrecognized by most providers due to lack of familiarity. This project aims to leverage cutting-edge biomedical informatics and data science methodology to develop, validate, and demonstrate the clinical utility of an EHR-driven approach for rare diseases clinical decision support systems. Support for diagnosis of rare diseases will enable patients and providers to move efficiently beyond diagnoses to treatments and support for their condition. The types of tasks include training early diagnostic models using EHR data, optimizing the model to overcome any potential bias across different genetic ancestries, and developing visualization tools to provide an explanatory dashboard for clinical decision support. In this project, our aim is to develop a methodology to efficiently identify potential rare disease candidates from large EHR pools. The identified dataset will subsequently undergo manual review and labeling to serve as a training dataset for other supervised learning tasks. The end goal includes manuscript submission and a reproducible pipeline that can be generalized to other external institutions.

CANDIDATE REQUIREMENTS

Required Skills:

– Proficiency in programming languages such as R and Python is essential. Familiar with packages such as pandas or dplyr. The student should be able to write clean, efficient code to extract insights from data. The student with experience in working with diverse datasets, including longitudinal data, and structured/unstructured sources, is highly valuable. The student should possess the ability to clean, preprocess, and integrate data effectively. Skills in data visualization tools and libraries (e.g., Matplotlib, Seaborn, ggplot2) are a big plus.

– A strong foundation in machine learning and statistical analysis is necessary for building predictive models, conducting hypothesis testing, and extracting meaningful patterns from data.

– Skills with front-end app development (React) or experience with Javascript will be a big plus

– Experience with natural language processing and large language model will be a big plus

Student Eligibility: Master’s, Senior

International Students on F1 or J1 Student Visa: Yes, eligible

Spring 2024

School: School of Nursing

Department: School of Nursing

Average Hours per Week: Approximately 10

Stipend Amount: $3,000

Project Overview: The DSI Scholar will work on cleaning and analyzing data for two projects focused on the influence of daily discrimination on cardiovascular disease risk. The first project is our DSI Seed Grant, which was a 30-day daily diary study that investigated the impact of daily discrimination on sleep health in a sample of Black and Latinx LGBTQ+ Adults. Specifically, we plan to use unsupervised machine learning to identify sleep phenotypes and their associations with daily discrimination. The next project, which was recently funded by the National Heart, Lung, and Blood Institute, is a 1-week daily diary study that investigates the influence of anticipated and vicarious discrimination on home blood pressure. The DSI Scholar will assist our team with developing and maintaining our study database, completing ongoing data cleaning, tracking data collection, developing code for data analysis, and addressing any data concerns (as needed).

CANDIDATE REQUIREMENTS

Required Skills: The DSI Scholar should have fluency in R/Python and an interest in health disparities research. Familiarity with machine learning and multilevel modeling is preferred but not necessary. We have longitudinal sensor and daily diary data that the DSI Scholar will help analyze but prior experience is not needed.

Student Eligibility: Master’s

International Students on F1 or J1 Student Visa: Yes, eligible
School: Climate School

Department: Lamont-Doherty Earth Observatory

Average Hours per Week: Approximately 10

Stipend Amount: $3000

Project Overview: Our group has recently developed a new method for computing the flow of light through the earth’s atmosphere (https://doi.org/10.1029/2023MS003819) – as task that’s key to climate projections and weather forecasting. The method relies on data-driven optimization: one defines a set of states over which to optimize, makes detailed, computationally expensive reference calculations based on those states, then identifies a very small optimal subset of the reference calculations that can be used as a proxy for the fully detailed calculations. The method is appealing in part because it’s flexible – it can be applied to arbitrary conditions with arbitrary cost functions for optimization.

We’d like to make it easier for people to use this idea for their own purposes, starting with using the tools ourselves to do a more complete and complicated version of the idealized problem we first took on. One task will be taking the original set of (clean, modular!) Python scripts and Jupyter notebooks and developing these into a fully general Python package that can be distributed via PyPi and Conda for wider use. During the course of this development we’ll apply the tools to the complete range of greenhouse gasses in the atmosphere, which may require identifying or developing smarter ways of allowing many small contributors to vary at once.

If successful the project stands to have an immediate impact – the group has collaborators at both weather forecasting and climate modeling centers who are interested in using a mature version of this technique.

CANDIDATE REQUIREMENTS

Required Skills: The project requires fluency in scientific Python, the ability to refactor code from scripts into Python modules, and the willingness to develop automated testing, packaging, and distribution. Ability and willingness to discuss the underlying physical science would be an advantage.

Student Eligibility: Master’s, Senior

International Students on F1 or J1 Student Visa: Yes, eligible
School: Vagelos Physicians & Surgeons

Department: Medicine

Average Hours per Week: Approximately 10

Stipend Amount: $3,000

Project Overview: Diagnostic errors affect up to 12 million adults per year and result in serious harm or death. Incorrectly ordered imaging tests are a major cause of missed diagnoses; however, little is known about why these errors occur. Current methods measuring imaging order errors are limited by reporting bias and the need for chart review. To address these gaps, I propose applying an innovative, systematic approach, the Retract-and-Reorder (RAR) method, to develop automated measures to identify imaging order errors. Electronic health record data (EHR) will be queried to identify imaging RAR events, defined as imaging orders placed, retracted, and subsequently reordered for the same patient with an element of the order changed. We aim use the RAR method to detect imaging order errors with a high accuracy. I aim to develop the first automated wrong-imaging order error measures to 1) examine the epidemiology of imaging order errors in a large healthcare system and 2) provide reliable outcome data for studies to trial system-level interventions to reduce these types of errors, to improve diagnostic safety and accuracy. Specific tasks will include working with a preexisting relational database in a server from the department of biomedical informatics. This database will have robust EHR clinical and log data. From this database will use data-driven methods to design the queries for the measures to identify diagnostic imaging order errors. We will use quantitative and qualitative analyses in a mixed-methods research approach to inform query specifications to identify these types of errors with high accuracy.

CANDIDATE REQUIREMENTS

Required Skills: Fluency in SQL Server Management Studio is preferred, but not necessary. Fluency in SQL, Python, or R is also preferred, but not necessary.

Student Eligibility: Master’s

International Students on F1 or J1 Student Visa: Not eligible
School: Climate School

Department: Center for International Earth Science Information Network (CIESIN)

Average Hours per Week: Approximately 10

Stipend Amount: $3,000

Project Overview: CIESIN is interested in identifying open plastic dumps that are potentially vulnerable to climate change. The health and environmental risks as well as social justice issues posed by open plastic dumps can be compounded by climate change events.

A DSI Scholar would provide coding and other technical support within the context of this global plastics project through two parallel work streams.The first is to extract values for land use disturbance, flooding, changes in rainfall and temperature extremes, and demographic information from large datasets and assign these to the polygons delineating plastic dumps boundaries over time. The resulting dataset will be explored to identify the climate risks associated with individual open dumps and the populations that could be impacted. The expected platform to be used is Google Earth Engine and coding in python or Java. The second workstream is to locate and link plastic trade related import and export data to the relevant countries and potentially the actual open plastic dumps.

CANDIDATE REQUIREMENTS

Required Skills: Fluency in scripting languages for data analysis experience with import export data preferred

Student Eligibility: Master’s, Senior, Junior, Sophomore, Freshman

International Students on F1 or J1 Student Visa: Yes, eligible
School: Teachers College

Department: Human Development

Average Hours per Week: Approximately 10

Stipend Amount: $3,000

Project Overview: Generative AI has shown great promise for education, but who it might actually benefit in practice is a serious equity concern. This project aims to shed light on this dilemma by examining systematic disparities in public responses to generate AI in education, including 1) institutional academic policies; 2) students’ online discussions; and 3) relationships between these responses and institutional characteristics. Project tasks may include: 1) acquiring and cleaning large-scale text and administrative data via web scraping or APIs; 2) performing NLP tasks such as sentiment analysis and topic modeling, potentially using LLMs; and 3) statistical analyses, reporting, and data visualization, including geospatial mapping. The findings will provide solid empirical evidence on digital inequalities in the emergence of generative AI and inform best practices to improve educational equity through these technologies.

CANDIDATE REQUIREMENTS

Required Skills:Qualified students should be skilled in NLP (with Pytorch, Hugging Face, etc.), statistical methods (with R), and have a strong interest in computational social science and a passion for social good. The scholar will work with the research team to contribute to all aspects of the project and lead additional analyses. Students who intend to pursue a doctoral degree in the future is a plus.

Student Eligibility: Master’s, Senior

International Students on F1 or J1 Student Visa: Yes, eligible
School: Vagelos College of Physicians & Surgeons

Department: Genetics & Development (in Systems Biology)

Average Hours per Week: Approximately 10

Stipend Amount: $3,000

Project Overview: We are seeking an enthusiastic and motivated undergraduate student to join our research team as an intern, focusing on the analysis of microscopy data to study chromosome rearrangement and loss of heterozygosity (LOH) after DNA damage. LOH is a principle driver of cancer progression and understanding how it is generated after DNA damage has implications for cancer biology. This internship provides a unique opportunity to contribute to cutting-edge research in genetics. The selected candidate will work closely with experienced researchers and gain valuable skills in data analysis and scientific research techniques.

Key Responsibilities:Microscopy Data Analysis: Analyze microscopy images to study chromosome structure and organization after DNA damage. This includes writing scripts for specialized software to quantify chromosomal aberrations, measure distances between specific chromosomal regions, and assess the overall impact of DNA damage on chromosome rearrangement.

Data Interpretation: Interpret and document the results of microscopy analyses, identifying patterns and trends related to chromosome rearrangement. Identify data features for development of machine learning protocols to classify recombination outcomes. Collaborate with colleagues in Systems Biology to implement the algorithm to draw meaningful conclusions from the data and contribute to scientific discussions.

Literature Review: Stay up-to-date with relevant scientific literature on mitotic recombination, LOH, chromosome rearrangement and DNA damage. Summarize and present key findings to the research team.

Documentation: Maintain detailed records of analysis methods, results, and conclusions. Prepare comprehensive documentation and reports for inclusion in scientific publications.

Visualization: Generate clear and informative visual representations of the analyzed data, including graphs, charts, and figures, to facilitate data interpretation and presentation.”

CANDIDATE REQUIREMENTS

Required Skills: Strong interest in genomics, DNA damage response, and chromosome biology and the desire to help develop large scale data analysis for a microscopy problem. Basic understanding of microscopy techniques, image analysis and familiarity with data analysis software and programming languages (such as Python, R, or ImageJ) would be a plus. Excellent attention to detail, analytical skills, and ability to work independently. Strong communication skills and ability to work effectively in a team-oriented environment.

Student Eligibility: Master’s, Senior, Junior, Sophomore, Freshman

International Students on F1 or J1 Student Visa: Yes, eligible
School: Arts and Sciences

Department: Columbia Justice Lab

Average Hours per Week: Approximately 10

Stipend Amount: $3,000

Project Overview: The Probation and Parole Reform Project (PPRP), housed in the Columbia Justice Lab, conducts actionable research that challenges the way probation and parole operate in the U.S. We envision a world where probation and parole are smaller, less punitive, equitable, and helpful, and where resources are invested directly to communities in ways that advance collective efficacy, opportunity, and racial equity. As a key part of this work, we seek to understand and publicize the full carceral impact of probation and parole policies, also known as community supervision – a key area of concern is jail detention for technical supervision violations.

While probation and parole were designed to divert people away from incarceration, community supervision is often attached to fees, curfews, and employment or programming mandates. When someone is unable to fulfill these conditions they become at risk of arrest or incarceration due to a technical violation of supervision requirements. Community supervision casts a wide net, surveilling three times as many people as there are in prisons. However, the number of people being incarcerated due to community supervision violations is not captured in current data or policy analysis.

The DSI Scholar will leverage a recently-available jail data to better capture the larger footprint of community supervision, and to identify inequalities in incarceration due to probation and parole across time and space. The dataset contains individual level arrest data for probation and parole violations scraped daily from over 1000 publicly available jail rosters in the U.S. since 2019. The end goal would be to use this data to highlight and better understand the full scope of incarcerations due to technical violations, and design empirically grounded policy recommendations on how to minimize incarceration and reduce racial inequalities within community supervision.

CANDIDATE REQUIREMENTS

Required Skills: The scholar must be proficient in R and experience with Python is a plus. They should also have experience with web-scraping and database management for large, longitudinal datasets. Experience with data visualization is also essential, including graphical presentations of longitudinal data as well as experience working with and presenting spatial data.

We are also interested in linking administrative datasets. For example, linking jail rosters to voter registration data. For this, the ability to automate data cleaning processes is also highly encouraged. For example, designing algorithms to match individuals across multiple arrest records even when their name is misspelled in a subset of observations.

Student Eligibility: Master’s, Senior

International Students on F1 or J1 Student Visa: Yes, eligible
School: School of Engineering and Applied Science

Department: Civil Engineering & Engineering Mechanics

Average Hours per Week: Approximately 10

Stipend Amount: $3,000

Project Overview: In recent years, Large Language Models (LLMs), such as GPT-3, GPT-4, and LLama 2, are algorithms trained on extensive datasets, exhibiting exceptional zero-shot learning capabilities across numerous unlabelled tasks. Building on this notion, in-context learning involves conditioning LLMs on specific linguistic instructions or task demonstrations, subsequently enabling them to tackle analogous tasks through sequence predictions. In the field of Travel Mode Analysis, a significant volume of unlabeled data exists. Of particular interest are the unlabelled tweets generated by commuters, which offer insights into evolving travel patterns, especially in the context of events like a pandemic. By harnessing the strengths of LLMs and in-context learning, there exists potential to extract valuable insights from unlabelled data.

CANDIDATE REQUIREMENTS

Required Skills: Experience in coding in Python. Experience in NLP and PyTorch is preferred.

Student Eligibility: Master’s

International Students on F1 or J1 Student Visa: Yes, eligible
School: Columbia Climate School

Department: Advanced Consortium on Cooperation, Conflict and Complexity (AC4)

Average Hours per Week: Approximately 10

Stipend Amount: $3,000

Project Overview: “Hate Speech” is a term used by peacebuilders, content moderators, policy-makers, and others, to label and categorize language, especially as it shows up in digital media. It is associated with inciting conflict and violence, and it may reflect the conditions of social relations among people across nations. Yet, while hate speech continues, so do other forms of speech that may reflect prosocial behaviors among people around the world as well. What are the properties of this “Peace Speech” that may lead to better outcomes and support continued and sustainable peaceful conditions in nations throughout the world?

Our interdisciplinary team in the Advanced Consortium on Cooperation, Conflict and Complexity (AC4) at the Columbia Climate School includes researchers in psychology, social psychology, environmental sustainability, natural resource governance, and applied anthropology. Together, our team is working to identify linguistic differences from peaceful and less peaceful societies, and the features of “Peace Speech”, that may reflect and support social processes underlying sustainably peaceful conditions. Using 3 data bases, we have already identified individual words that machine learning models use to best classify nations as lower or higher peace. See for example, https://arxiv.org/abs/2305.12537 We now want to cluster those words into topics to identify which topics are most important in differentiating lower and higher peace countries, so that we can gain insight into the social processes that those topics represent.

CANDIDATE REQUIREMENTS

Required Skills: Fluency in Python, natural language processing (cleaning text, NLTK, spaCy, Google’s BERT, HuggingFace XLnet), longitudinal analysis (time series), clustering analysis (k-means, word2vec, cosine similarity, ChatGTP), machine learning (logistic regression, random forest, XGBoost, support vector machines, neural networks, deep learning). The short term goal is to identify the topics in news and social media that best classifies lower and higher peace countries, topics such as governance, politics, international relations, work, everyday life activities, economics, arts, personal preferences, hobbies, etc. The longer term goal is to use machine learning and AI to identify the social processes that underlie “Peace Speech”.

Student Eligibility: Master’s, Senior, Junior, Sophomore, Freshman

International Students on F1 or J1 Student Visa: Yes, eligible
School: Climate School

Department: International Research Institute for Climate and Society (IRI) and Department of Earth and Environmental Sciences (DEES)

Average Hours per Week: Approximately 10

Funding: This is a grant funded project. Exact amount of funding will depend on hours completed.

Project Overview: Sub-seasonal-to-seasonal (S2S; weeks to months) time-scale predictions have great potential societal benefits, such as early warnings of heavy rains, droughts, and heat waves. Reliable forecasts a few weeks ahead can provide invaluable tools for routine planning in the agriculture, water resources, public health, and humanitarian aid sectors. However, the skill of current S2S forecasts made using large physics-based climate model ensembles is limited, partly because model simulations depart from reality. Calibration of climate-model forecast probabilities is necessary to account for model deficiencies and produce reliable forecasts. State-of-the-art calibration methods in IRI’s climate predictability tools are linear, which is a limitation because the atmospheric flow is inherently non-linear, and model errors often grow exponentially.

The goal of this project is to develop machine learning/artificial intelligence (ML/AI) forecast tools that enable non-linear bias correction to meet the growing service demands on improved forecast products at S2S time scales. The intern will code and run test cases to compare the performance of different ML methods (e.g., Regression Trees, CNNs, deep learning) to improve Indian summer monsoon probabilistic forecast skill by bias-correcting/calibrating sets of S2S forecast ensembles from large physics-based climate models run at global climate forecasting centers (e.g., NCEP, ECMWF) and archived in IRI Data Library.

CANDIDATE REQUIREMENTS

Required Skills: Fluency in Python coding and libraries, Jupyter Notebooks, and use of GitHub repos. Experience with using various ML methods (e.g., Regression Trees, CNNs, deep learning) is required. Experience with large climate data and model output datasets would be an advantage, but is not required.

Student Eligibility: Master’s, Senior, Junior

International Students on F1 or J1 Student Visa: Yes, eligible
School: Medicine

Department: Pathology & Cell Biology

Average Hours per Week: Approximately 10

Funding: This is a grant funded project. Exact amount of funding will depend on hours completed.

Project Overview: The brain is the most complex organ in the body, composed of billions of neurons and trillions of connections between those neurons. Those connections are known as synapses and have been for many years the subject of intense study. What is less clear, however, is how synapses are organized at a population level throughout the brain. To start to address this, we developed a method that analyzes individual synapses using spatial and intensity metrics and scaled this approach to analyze hundreds of thousands of synapses concurrently. By doing so, we found that synapses fall into previously unknown, but functionally-relevant, subpopulations. The student project, which is a collaboration between 2 groups (the Au lab in Pathology and Cell Biology and Menon lab in Neurology) will be to help identify synaptic subpopulations under various experimental conditions and to and to analyze their spatial arrangement in the brain. This will help to reveal functional submotifs in the cortex and glean novel insights into cortical circuit organization.

CANDIDATE REQUIREMENTS

Required Skills: Fluency in python is a must. Experience with machine learning, pytorch and scanpy preferred. Experience with multidimensional image analysis ideal.

Student Eligibility: Master’s, Senior

International Students on F1 or J1 Student Visa: Not Eligible