DSI Scholars Projects

Fall 2024 Projects

The DSI Scholars Fall 2024 Student Application has closed.

Eligibility

Please note that in order to be considered for Fall 2024 projects, students must be enrolled for the Fall 2024 semester. Students that are graduating in May or October 2024 are not eligible for the Fall application.

Applying to More than One Project

Students are welcome to apply for up to 5 projects per term. You must submit a separate application for each project. If you submit more than 5 applications, we will randomly select 5 of your applications for submission.

Students interested in participating in this program are encouraged to review the list of available projects below. Before applying, please be sure to read the descriptions carefully to ensure you meet the eligibility requirements and prerequisites for each project.

For more information about the program, including the program benefits, application process and timeline, please visit the DSI Scholars Student Information Page.

Faculty interested in participating in Fall 2024 are encouraged to review the DSI Scholars Faculty Information page for details.

Important Dates:

June 3, 2024 (11:59 PM ET): Student applications due
June 17 – 28, 2024 (expected): Student interviews
July 12, 2024 (expected): Decision notification will be sent to all applicants by email
September 2024 – December 2024: Get started on your Scholars research project (Approximate duration. Exact dates will depend on the faculty supervisor)

Please note: During the summer, students should expect to spend approximately 150 hours on the project. The final number of hours and project duration will depend on the faculty supervisor. The minimum research stipend for summer will be $3,000.

Fall 2024

School: Vagelos College of Physicians and Surgeons

Department: Ophthalmology

Project Overview:

Motivated by the global prevalence of untreated vision impairment, we seek to address the need for more accurate and timely diagnosis for irreversible eye diseases. Traditionally, ophthalmologists rely on 2D optical coherence tomography (OCT) reports derived from raw 3D OCT data, but this approach can lead to errors, especially in cases of atypical ocular anatomy or imaging artifacts. Although raw 3D OCT data provides a more comprehensive view of the retina, its complexity and time-intensive analysis present significant hurdles to its practical use. This DSI Scholars project has two aims: (1) develop a deep learning (DL) model that transforms seamlessly between 2D OCT reports and 3D underlying OCT data, and (2) visualize this 3D data for ophthalmologists through augmented reality/virtual reality (AR/VR).

Aim 1: Working closely with ophthalmologists at CUIMC, we will acquire annotations within temporal regions of 3D OCT data that relate to anatomical features of importance in 2D OCT reports, providing 3D-to-2D mapping information to train our DL model. Architectures to be explored include 3D UNets, autoencoders, or 2D Vision Transformers applied on subsets of slices from 3D OCT volumes (employing self-attention and cross-attention across patches from multiple slices).

Aim 2: With our generative 2D-to-3D transformation model from Aim 1, we will design a method for clinicians to visualize synthesized 2D and 3D data simultaneously through AR and/or VR. This will involve integrating the model developed in Aim 1 into the Unity Engine for real-time processing of inputs and visual rendering for ophthalmologist users.

By implementing these aims, our 2D-to-3D AI transformation and AR/VR visualization system will empower ophthalmologists with comprehensive insights derived from 3D OCT data. By extracting features not accessible through traditional 2D analysis alone, our approach has the potential to assist in expediting care for those suffering from vision impairment.

CANDIDATE REQUIREMENTS

Required Skills:

– Fluency in Python and Pytorch

– Awareness of/willingness to probe literature related to 3D UNets, Vision Transformers, Autoencoders

– Familiarity with Unity Engine, Augmented Reality/Virtual Reality development and workflow

– Interest in ophthalmology, ability/interest to work collaboratively in a team of engineers and physicians

Student Eligibility: Master’s

International Students on F1 or J1 Student Visa: Yes, eligible
School: Zuckerman Mind Brain Behavior Institute

Department: Biology

Project Overview:

Naturalistic animal behavior is built from simpler behavioral modules that reflect the organization and function of the underlying neural circuits.

To understand the parental behavioral differences among Peromyscus mice, we track freely moving adult mice while they are retrieving pups that where removed from their nest. We use marker-less 3D pose-estimation software based on machine learning (SLEAP and DeepLabCut) and extract meaningful parameters including velocity, acceleration, turning, orientation of the adult to the pup, and distance between pup and the adult mouse. We already have a test dataset of ~200 individual pup retrieval sequences for which we extracted the kinematic parameters.

Next, we want to classify individual pup retrieval sequences based on these parameters using random forest or autoregressive hidden Markov model algorithms to identify behavioral modules that the adult animal is routinely performing during this task.

In the future, we want to use these trained models to predict behavioral modules on a new dataset for which we also recorded the mice’s brain activity.

CANDIDATE REQUIREMENTS

Required Skills: The ideal candidate would be comfortable with basic probability as well as multivariate calculus and linear algebra. They will have to implement models and algorithms in Python, so coding proficiency is important. No biological background is strictly needed as we will teach the candidate everything that is needed to successfully finish the project.

Student Eligibility: Master’s, Senior, Junior, Sophomore, Freshman

International Students on F1 or J1 Student Visa: Yes, eligible
School: Climate School

Department: International Research Institute for Climate and Society

Project Overview:

Sub-seasonal-to-seasonal (S2S; weeks to months) time-scale predictions have great potential societal benefits, such as early warnings of heavy rains, droughts, and heat waves. Reliable forecasts a few weeks ahead can provide invaluable tools for routine planning in the agriculture, water resources, public health, and humanitarian aid sectors. However, the skill of current S2S forecasts made using large physics-based climate model ensembles is limited, partly because model simulations depart from reality. Calibration of climate-model forecast probabilities is necessary to account for model deficiencies and produce reliable forecasts. State-of-the-art calibration methods in IRI’s climate predictability tools are linear, which is a limitation because the atmospheric flow is inherently non-linear, and model errors often grow exponentially.

The goal of this project is to develop machine learning/artificial intelligence (ML/AI) forecast tools that enable non-linear bias correction to meet the growing service demands on improved forecast products at S2S time scales. The intern will code and run test cases to compare the performance of different ML methods (e.g., Regression Trees, CNNs, deep learning) to improve Indian summer monsoon probabilistic forecast skill by bias-correcting/calibrating sets of S2S forecast ensembles from large physics-based climate models run at global climate forecasting centers (e.g., NCEP, ECMWF) and archived in IRI Data Library.

CANDIDATE REQUIREMENTS

Required Skills:

– Fluency in Python coding and libraries, Jupyter Notebooks, using GitHub repos.

– Experience with using various ML methods (e.g., Regression Trees, CNNs, deep learning) is required.

– Experience with climate data and model output would be an advantage, but not required.

Student Eligibility: Master’s, Senior, Junior, Sophomore

International Students on F1 or J1 Student Visa: Yes, eligible
School: Vagelos College of Physicians and Surgeons

Department: Radiology

Project Overview:

Mild traumatic brain injury (mTBI), also known as concussion, remains largely invisible on standard MRI images despite survivors of car accidents or falls or domestic dispute victims report neurological and cognitive symptoms. Some patients recover within months, but as many as 60% remain symptomatic at 6 months and 30% or more suffer for years and many for the rest of their lives. Lack of accepted and easily adopted clinical diagnostic tools severely limits identification of the subgroup with poor prognosis, development of interventions and treatment options as well as advancement in understanding the underlying mechanisms of the injury.

We use diffusion tensor imaging (DTI), a widely available quantitative MRI technique to detect and visualize subtle abnormalities in microstructure of white matter of the brain in these patients. Patient’s images are compared to those of healthy controls in voxel-by-voxel manner to localize areas of abnormality. Computationally, this is a very CPU-intense process, which requires tasks including image processing, image registration and robust statistics.

The main goal of this project is to develop an AI-based algorithm to speed up identification of localized regions of abnormalities. A large data repository, including processed images from healthy controls and mTBI patients, is available within the Translational Neuroimaging Laboratory at CUIMC to develop and train the algorithm and perform its testing. The project aims to achieve the following targets for key performance indicators of the new algorithm:
1. Speed: identify abnormality in less than 10 min on an AMD Threadripper 3.9GHz CPU with NVIDIA Quadro T600 video card.
2. Overall quality metric: true positive 99% ; true negative 99%.
3. Patient specific quality metric when abnormality is detected: Dice index against existing implementation is better than the existing implementation against itself shifted by 1mm in three cardinal directions and main diagonals for that patient.
CANDIDATE REQUIREMENTS

Required Skills: Experience working with various ML/AI models (e.g. RESNET, UNET, Inception, VGG), documentation/organization skills, familiarity with Linux

Student Eligibility: Master’s, Senior, Junior, Sophomore, Freshman

International Students on F1 or J1 Student Visa: Yes, eligible
School: Climate School

Department: Seismology, Geology and Tectonophysics

Project Overview:

Volcanology has transformed into a highly data-driven and computationally focused field. Numerous computational models have been developed to simulate various physical phenomena during volcanic eruptions. A critical component of forecasting the behavior of volcanic eruptions is assessing the probability of different outcomes and scenarios. For this, scientists implement probabilistic modeling approaches, which provide such assessments. The DSI scholar who will join this project will be tasked with creating workflows that perform simulations and generate hazard maps with quantitative probability assessment. The workflows will be created using Jupyter Notebooks and be based on a range of eruption simulation tools written in different languages. The workflows will utilize probabilistic tools such as Markov Chain Monte Carlo (MCMC) or the Ensemble Kalman Filter (EnKF). The objective of this project is to provide a comprehensive and flexible tool for members of the volcanology community to utilize. In the long term, this tool could be expanded to use in any geophysical process that requires uncertainty quantification.

CANDIDATE REQUIREMENTS

Required Skills:

– Fluency in Python is required.
– Familiarity with statistical methods and notation is highly preferred.
– We expect the scholar to be comfortable with reading academic papers or other high-level readings to familiarize themselves with the concepts.
– Other scripting skills and programming language knowledge will be beneficial to work as well.

Student Eligibility: Master’s, Senior, Junior, Sophomore

International Students on F1 or J1 Student Visa: Yes, eligible
School: School of Nursing

Department: Nursing

Project Overview:

The DSI Scholar will work on cleaning and analyzing data for two projects focused on the influence of daily discrimination on cardiovascular disease risk. The first project is our DSI Seed Grant, which was a 30-day daily diary study that investigated the impact of daily discrimination on sleep health in a sample of Black and Latinx LGBTQ+ Adults. Although we have 3 papers under review and 2 papers in progress from the DSI Seed Grant, we have additional data that needs to be analyzed. Specifically, we plan to use unsupervised machine learning to identify sleep phenotypes and their associations with daily discrimination. The next project, which was recently funded by the National Heart, Lung, and Blood Institute, is a 1-week daily diary study that investigates the influence of anticipated and vicarious discrimination on home blood pressure. The DSI Scholar will assist our team with maintaining our study database, completing ongoing data cleaning, tracking data collection, developing code for data analysis, and addressing any data concerns (as needed).

CANDIDATE REQUIREMENTS

Required Skills:

– The DSI Scholar should have fluency in R/Python and an interest in health disparities research.
– Familiarity with machine learning and multilevel modeling is preferred but not necessary.
– We have longitudinal actigraphy and daily diary data that the DSI Scholar will help analyze but prior experience is not needed.

Student Eligibility: Master’s

International Students on F1 or J1 Student Visa: Yes, eligible
School: Mailman School of Public Health

Department: Epidemiology

Project Overview:

The goal of this project is to predict climate related physiologic stress on participants in a study on overland migration and nutrition security. The project has two leads and mentors (M. Orjuela-Grimm and Robbie Parks (Environmental Health Sciences)). Tasks to fulfill study aims include working with time- referenced and georeferenced data from reported migration trajectories of 104 Latin American overland migrants during the summer and fall of 2023, and possibly summer 2024. The data needs to be matched to date specific climate variables (ambient air temperature (considering daily ranges), humidity) and geographically specific elevation, and then modeled to consider changes that may act as physiologic stressors, taking into consideration trajectory (departure point) and geographic challenges. Each migrant will have daily data points from 14 to 100 days. Data sources will include ERA5, ERA5-Land as well as other sources. Modeling strategies may include estimating heat stress with Wet Bulb Globe Temperature and Heat Index.

The data can potentially be combined / compared with indicators from the water insecurity experience scale collected from the same population.

The end goal is to create a method to approximate such stressors and model their potential impact on health related indicators in migrants in overland migration routes. Ultimately the data will be used to help inform health related service provision at migrant shelters in Mexico. The work would be expected to result in data that would serve for an abstract submission at the end of the fall semester, with subsequent poster presentations, and potential manuscript submission. The data is from a multinational pilot study funded by the Institute of Latin American Studies.”

CANDIDATE REQUIREMENTS

Required Skills:

– Skill sets include fluency in R, Python, GitHub, project management / Python, as well as an interest in geographic information, climate aspects, and a working knowledge of Spanish (fluency is an advantage) and of geography in Central America and Mexico.

Student Eligibility: Master’s, Senior

International Students on F1 or J1 Student Visa: Yes, eligible
School: Fu Foundation School of Engineering and Applied Science

Department: Biomedical Engineering

Project Overview:

Significance: Reward learning is a core cognitive function that allows humans and other animals to consistently make decisions that optimize behavior and enhance survival. Deficits in reward learning are common in neuropsychiatric disorders, impairing patients’ ability to efficiently interact with their environment. Despite extensive studies in rodents, the generalizability of reward learning mechanisms to humans remains poorly understood. This research aims to close this gap by identifying the computational principles that govern reward learning in the brain. Understanding these mechanisms is critical for identifying disruptions in reward-based processes associated with disorders, thereby improving biomarker identification and enhancing health analytics for better clinical outcomes. Furthermore, by decoding the biological principles of reward learning, this research could lead to the development of a new class of energy-efficient reinforcement learning (RL) models that employ cortical coding schemes.

Approach: This research project focuses on probing the neural codes and computational algorithms the brain uses to tackle dynamic reward-learning tasks. We are particularly interested in tasks where the optimal solution not only changes over time but is also influenced by the policies of other agents within the environment. To investigate this, we will create a simulated environment initially featuring a single agent, with a second agent introduced later. Each agent will be constructed as a biologically-plausible reinforcement learning model, each employing distinct, time-varying learning rules. By simulating a wide range of learning rule functions for each agent, we aim to elucidate biological reward-learning mechanisms at play both in individual scenarios and in more complex settings where an agent’s policy is affected by dynamic environmental changes, including the strategies adopted by other agents. Working alongside myself and our interdisciplinary team of computational and systems neuroscientists, a DSI scholar will play a critical role in developing this biological multi-agent RL platform and systematically reverse-engineering the agents dynamics.

CANDIDATE REQUIREMENTS

Required Skills:

– ML/AI, deep learning models (prior experience in reinforcement learning (RL) models is strongly preferred), advanced programming, foundational knowledge in linear algebra, calculus, and statistics.

Student Eligibility: Master’s, Senior

International Students on F1 or J1 Student Visa: Yes, eligible
School: CUIMC

Department: Psychiatry

Project Overview:

In our lab we record fluctuations in neurotransmitter levels in the brains of mice in real time while they undergo tests of learning and decision-making. This is accomplished by measuring fluorescent signals from genetically encoded optical biosensors using fiber photometry (Simpson et al., Neuron. 2024, PMID: 38103545). Our overall goal is to understand the neurobiological basis of behaviors disrupted in psychiatric and neurological disorders. The specific aim of this project is to determine how expecting effort influences choice. Effort-based decision making is variable across healthy individuals (from “work-shy” to “workaholic”). For many psychiatric patients an exaggerated weighting of anticipated effort results in debilitating apathy and amotivation.

We collected dopamine recordings from multiple brain regions simultaneously in mice performing effort-based decision tasks in our custom automated test chambers. The DSI Scholar will work on this dataset together with the PI and the lab members that designed, collected, and pre-processed the data. The scholar will use non-linear multiple regression to determine which task events and behavioral measures (including dichotomous and continuous variables) predict the dopamine signals in each brain region. Because some segments of the behavior are self-paced, we will use dynamic time-warping to align some events. Because different physiological processes modulate dopamine release on different timescales, we will also perform dynamic regression modeling by adding lags as explanatory variables.

Expected Outcomes:
- Identification of task contingencies and behavioral events that predict changes in dopamine signals across timescales.
- Quantification of relationships between dynamic dopamine signals across different brain regions.
- The information gained will inform future experiments (optogenetic manipulations of dopamine neuron activity) to test causality and direction of dopamine-behavior relationships.
- Write-up of the analysis for presentations and an original research article.
- The potential for adapting the developed regression models for other data sets (different behavioral paradigms, neurotransmitters, and brain regions).”
CANDIDATE REQUIREMENTS

Required Skills:

– Python, multiple regression models, time series analysis, documentation (e.g. jupyter notebooks), and data/code sharing platforms e.g. OSF, Github).

– Neuroscience background knowledge is a plus, but basic biology will suffice (all aspects of the data collection and biological relevance will be explained).

– An interest in psychology/psychiatry related research and a desire to work collaboratively with the research team (including undergrads, grad students, postdocs, and associate research scientists).

Student Eligibility: Master’s, Senior, Junior

International Students on F1 or J1 Student Visa: Yes, eligible

Summer Projects 2024 (Closed)

School: Columbia Climate School

Department: Advanced Consortium on Cooperation, Conflict and Complexity (AC4)

Project Overview:

“Hate Speech” in on-line media can incite conflict and violence in real life. We study “”Peace Speech”” that leads to positive prosocial behaviors that support sustainable peaceful conditions in nations throughout the world. Our interdisciplinary team at the Advanced Consortium on Cooperation, Conflict and Complexity (AC4) at the Columbia Climate School, includes researchers in psychology, social psychology, environmental sustainability, natural resource governance, applied anthropology, and data science. We have already successfully used machine learning to identify the words in on-line news media that best classifies countries as lower or higher peace, published in 2023 in PLOS ONE, https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0292604.

In those and subsequent studies, we used logistic regression, random forest, XGBoost, SVM, BERT, and XLNet to analyze on-line data from both news media and social media. In this new work we will substantially extend those studies by using new and powerful methods from Artificial Intelligence (AI): 1) to further identify the linguistic differences between lower and higher peace societies, 2) to reveal the social processes that underly those linguistic differences, and 3) to create a real-time dashboard of the levels of peace and the processes that support them. To accomplish these tasks we will use pre-trained AI systems, such as ChatGTP, Claude, and Bard, as well as fine tuning those systems with additional data from studies of the social psychology of peace. Because of implicit and/or explicit bias in the data used to train those proprietary models and the “guardrails” that limit their responses, we may need to explore our own training of open source models such as Llama 2 from Meta and Mixtral from Mistral. This work will advance our scientific understanding of the social factors that enhance peace as well as provide valuable, practical insights for policy makers to support

CANDIDATE REQUIREMENTS

Required Skills: Fluency in Python, natural language processing (NLTK, spaCy, BERT, XLNet), longitudinal analysis (time series), machine learning (logistic regression, random forest, XGBoost, SVM, TensorFlow, PyTorch). The project will be centered on using AI, familiarity with models like openAI’s ChatGTP, Anthropic’s Claude, Google’s Bard, Meta’s Llama, Mistral’s Mixtral and tools like prompt engineering in AI chat models and fine tuning and training methods using langchain, vector databases such as Pinecone, models like davinci-002 and Ada, will be very helpful. The short term goals are to use AI to characterize the properties of “Peace Speech”, and to identify the social processes that they represent. The longer term goal is to create a user friendly dashboard to monitor the current levels of peace in societies for academic research and policy makers.

Student Eligibility: Master’s

International Students on F1 or J1 Student Visa: Yes, eligible
School: School of Social Work

Department: Social Work

Project Overview:

We are conducting a pilot study designed to assess the feasibility and potential promise of the large language model (LLM)-based artificial intelligence (AI) chatbot approach to assist current/future service providers working with LGBTQ+ populations in learning and utilizing the latest science-based knowledge about LGBTQ+ issues and intervention. During the June – August 2024 period, the project goals are to finalize and implement evaluation and benchmarking tools to assess the quality (i.e., validity/accuracy compared to the empirical scientific knowledge base) and utility of outputs from popular and promising existing LLM AI chatbots. The evaluation and benchmarking tools involve human- and machine-driven approaches. We seek a DSI-supported skilled student to assist and participate, particularly in the machine-driven evaluation/benchmarking, as well as in identifying prevalence and conditions that result in chatbot hallucinations. As interested and appropriate, the student could also assist with the wider study activities (e.g., human-driven evaluation/benchmarking, developing and/or training with an appropriate corpus, publication, and presentation of findings, and grant writing). We also anticipate areas where the student may contribute/develop their areas of interest/specialization outside of the current pilot study (e.g., methodological issues with experimental research with LLM AI chatbots). We note that there should be DSI Scholar appropriate work in the September – December 2024 timeframe as well.

The types of tasks that might be required to fulfill the study aims include:

– Implement machine-driven evaluation of the study’s selected chatbots

– Use appropriate statistical or data analysis tools

– Learn and contribute regarding data provenance and detailed records of research procedures

– Ensure all research activities comply with ethical, equity, and safety standards.

– Attend project team meetings and perform administrative activities as needed

– Contribute and/or lead presentations and publication of findings, implications, etc.

CANDIDATE REQUIREMENTS

Required Skills:

– Understanding of concepts and techniques used in LLM and LLM implementation.

– Skills in text processing, language modeling, and understanding the nuances of human language (natural language processing).

– Programming and Software Engineering: Proficiency in programming languages like Python and knowledge of software development practices and tools.

– Ability to work with large datasets, including data cleaning, analysis, and visualization.

– Designing robust and scalable systems to support machine learning applications.

– Knowledge of cloud services and distributed computing for training and deploying large language models.

– Skills in designing user-friendly interfaces and understanding user needs for applications like ChatGPT.

– Keeping up with the latest AI research and being able to implement or adapt new findings.

– Understanding of the ethical, equity, and safety implications of AI and developing systems responsibly.

– Working effectively in multidisciplinary teams and communicating complex concepts clearly.

– Writing and analytical skills.

Student Eligibility: Master’s, Senior, Junior

International Students on F1 or J1 Student Visa: Yes, eligible
School: Arts and Sciences / Climate School

Department: Earth and Environmental Science / LDEO

Project Overview:

The ocean significantly mitigates climate change by absorbing fossil fuel carbon from the atmosphere. Cumulatively, since preindustrial times, the ocean has absorbed 40% of emissions. Marine Carbon Dioxide Removal (mCDR) are proposed engineered efforts to supplement the ocean’s natural uptake of anthropogenic CO2 from the atmosphere. A major challenge for mCDR is to quantify the additional carbon removal from the atmosphere given the large natural background carbon sink. Better understanding of the natural air-sea CO2 fluxes at regional scales is therefore required before mCDR additionality can be quantified.

To understand past changes, diagnose ongoing changes, and to predict the future behavior of the ocean carbon sink, we must understand its spatial and temporal variability. However, the ocean is poorly sampled and so we cannot do this directly from in situ measurements. In the McKinley group, we have developed several data science techniques to reconstruct ocean carbon data based on association to satellite-based full-field driver data. With this project, we wish to determine how well current and future ocean carbon observations can constrain background air-sea CO2 fluxes in potential mCDR deployment regions.

In summer 2024, the DSI Scholar will begin by learning about methods and data needed for this project, such as the pCO2-Residual product (Bennington et al. 2022, JAMES, doi:10.1029/2021MS002960) and output from Earth System Models (ESMs). They will review existing code and help develop improved workflows with a strong focus on data sharing and reproducibility. We look forward to having their expertise to improve machine learning methods in order to produce pCO2 products at smaller scales, specifically in areas of potential mCDR deployment. The student will also contribute to analysis of the reconstructed ocean carbon data and be included in publications resulting from this work.

CANDIDATE REQUIREMENTS

Required Skills: Fluency in Python, experience with foundational ML

Student Eligibility: Master’s

International Students on F1 or J1 Student Visa: Yes, eligible
School: Mailman School of Public Health

Department: Epidemiology

Project Overview:

This highly innovative and significant Data Science Institute Seed Project application will use a machine learning informed natural language processing (NLP) approach to qualitatively identify patterns and reasons for engaging in opioid-related polysubstance use and narratives around overdose and HIV risk behaviors from publicly available discussion forums on Reddit, a popular social media platform, which provides a ready-made source of abundant, naturalistic, first-person narratives for understanding substance use behaviors and patterns. This work takes an interdisciplinary approach by integrating data science, substance use epidemiology, and public health to improve our understanding of polysubstance use patterns. We propose to use human-in-loop machine learning approach, specifically NLP method, to analyze the patterns from unstructured Reddit comments to automatically cluster large similar unstructured text data and unearth latent patterns of polysubstance use and qualitatively explore the trends, patterns, and themes. Data collection for this project will rely on a “human-in-loop” or “supervised” natural language approach with the following steps:

1. data retrieval from opioid-related subreddit of interests,

2. feed algorithm with key drug terms to develop polysubstance use topics,

3. use the algorithm developed topics to extract polysubstance relevant subset of data,

4. select a random sample of the data, and

5. conduct a rapid review of the sample.

We will follow steps two through five until the random sample consists of polysubstance use posts, overdose, and HIV related behaviors. Data will be analyzed using directed content analysis, using Latent Dirichlet Allocation (LDA) to infer latent substance use topics from the comments posted by redditors. Four focus groups ranging from four to eight participants will be recruited to ecologically validate the NLP findings and capture the lived experiences of people who engage in opioid-related polysubstance use among people who use drugs.

CANDIDATE REQUIREMENTS

Required Skills: Fluency in R/Python, methods for Natural Language Processing, Latent Dirichlet Allocation (LDA), sentiment analysis, supervised and unsupervised machine learning, predictive modeling, etc.

Student Eligibility: Master’s

International Students on F1 or J1 Student Visa: Yes, eligible
School: School of Engineering and Applied Science

Department: Civil Engineering and Engineering Mechanics

Project Overview:

The reliability of public charging infrastructure is paramount for the successful transition to road transportation electrification. Consumers need to perceive it as dependable to consider shifting to electric vehicles (EV) or avoid reverting back to internal combustion engines. To ensure a reliable charging infrastructure network, faulty or unusable chargers need to be swiftly identified and repaired.

While standard monitoring can detect several failures, such as those in software and the electrical system, other failures like broken connectors or physical impediments hindering drivers from successfully charging are not currently captured [1]. Addressing these issues often requires expensive physical monitoring or relies on customer reports. However, a shift from typical charging point utilization may indicate the potential presence of undetected faults.

This project aims to explore a variety of alternative unsupervised learning techniques for anomaly detection. The goal is to identify and predict anomalous EV charging point use in public charging points using publicly available charging transaction data. The project will also analyze the relationship between the occurrence of anomalies or their duration and the characteristics of the charging point, such as venue type, location, and pricing category. Additionally, normal utilization metrics will be examined to identify any patterns related to maintenance issues.

The project utilizes public nationwide data from the US Department of Energy (specifically the EV-WATTS datasets) as well as other datasets.

[1] Karanam, V., Tal, G. (2024) Enhancing Electric Vehicle Charger Reliability: Developing a Tool to Swiftly Detect Hidden Charger Faults, Poster Presentation, 2024 TRB Annual Meeting.

CANDIDATE REQUIREMENTS

Required Skills: The student will be proficient in python programming and time series data modeling. LSTM autoencoders models are amongst the time series anomaly detection techniques in time series that will be tested.

Student Eligibility: Master’s, Senior

International Students on F1 or J1 Student Visa: Yes, eligible
School: Vagelos College of Physicians & Surgeons

Department: Emergency Medicine

Project Overview:

In febrile infants younger than 30 days, lumbar puncture (LP) is a procedure routinely performed to evaluate for meningitis. LPs are mainly performed in the emergency setting by clinicians and trainees. However, novice success rates are historically poor with over 60% failure rates that can lead to diagnostic uncertainty, prolonged pain, and unnecessary resource utilization. Reduction of unsuccessful and traumatic LPs in infants can improve diagnostic ability and reduce patient harm. Ultrasound performed at the point-of-care has the potential to increase LP success rates through improved visualization of the anatomy, however it is dependent on the skill of the operator to interpret findings accurately thereby limiting it’s efficacy in the population of providers that most needs it.

The main purpose of this project is to use a pre-existing ultrasound database of ultrasound spinal anatomy videos to develop an artificially intelligent algorithm that can identify the important anatomic structures for planning an infant lumbar puncture procedure.

We have already successfully designed a binary classification system using a limited dataset. Our next step is to work on object localization to help identify specific anatomic features of interest.

The specific aim is to design an object localizer for specific spinal anatomy using a corpus of ultrasound data and test accuracy of algorithmic feature recognition against expert labels in a hold-out set. Our secondary aim is to deploy the algorithm on a website or tablet to test real-time processing of ultrasound data.

To fulfill this aim, the team will need to achieve the following tasks:

1. Assist with object-level annotation of features

2. Use machine learning to develop intelligent algorithm for automated feature recognition

3. Test algorithm accuracy against expert gold standard

4. Deploy algorithm on website or local tablet to test real-time processing of data

We have a labelled data-set of 1515 frames with binary classification of anatomic features and an augmented dataset of 11224 frames.

Our desired end goal is a functional algorithm that can identify key features on spinal anatomy on ultrasound at a threshold of >95% accuracy.

CANDIDATE REQUIREMENTS

Required Skills: Experience working with various ML/AI models (e.g. RESNET, ALEXNET, VGG), documentation/organization skills (e.g. jupyter notebook, github), html (optional for parsing real-time algorithm).

Student Eligibility: Master’s

International Students on F1 or J1 Student Visa: Yes, eligible
School: Columbia College

Department: Latin American and Iberian Cultures

Project Overview: This project forms part of my book manuscript, Sorcery and the City in Post-Slavery Brazil. My project analyzes 135 witch trials that occurred during the first half of the Twentieth-century in Brazil to better understand why colonial anti-witchcraft made a comeback during the first decades of abolition and the first Brazilian Republic. My thesis is that witchcraft accusations were a means to uphold spatial and social divides and segregate cities like Rio de Janeiro, without the need to create racial segregation in written law. Witch hunts allowed the police to uphold state ideologies of racial and class divisions.

The main type of data I have collected are street addresses of where accused witches lived in Rio de Janeiro during the period from 1881-1942. I would like to work with a data science assistant to help me map these addresses onto old and contemporary maps of Rio de Janeiro to do a spatial-historical analysis to determine if these witch hunts did indeed reinforce spatial divides and segregation.

CANDIDATE REQUIREMENTS

Required Skills: The research assistant should have cartography/mapping skills to visualize geographical space (Rio de Janeiro city, state, and neighborhoods). RA should be able to work with historical maps, create maps, and use contemporary maps (google maps).

Student Eligibility: Master’s

International Students on F1 or J1 Student Visa: Yes, eligible
School: Graduate School of Arts and Science

Department: Ecology, Evolution and Environmental Biology

Project Overview: The Urban Wildlife Information Network (UWIN) was created by the Urban Wildlife Institute at the Lincoln Park Zoo as an alliance of urban wildlife scientists committed to conducting research to enhance our knowledge of urban wildlife and their relationships with people. While the UWIN project spans multiple universities and other stakeholders across the world, within NYC alone we at the Eco-Epidemiology Lab at Columbia University have a transect of nearly 50 wildlife cameras placed in parks and greenspaces along an urbanization gradient from Brooklyn to the furthest reaches of Nassau County. Our intent is to measure the effects of human occupancy and degrees of urbanization on wildlife and disease vectors-species richness and abundance.

A study of this scale comes with an ever-increasing amount of data, and in our case, this data comes in the form of hundreds of thousands of images of NYC’s local wildlife! While processing this information is traditionally done by staff, students, and volunteers pouring through these images and identifying the number and species of wildlife in each image, we are modernizing our approach with machine-learning AI technologies (such as Megadetector) to automatically detect and identify the species and quantity of wildlife present in these images, then attach this information to the image’s metadata and upload it the larger inter-city UWIN database. While we plan for this project to continue for many years, we are looking for students now to help create and implement a machine-learning model to identify and catalog our current and future sets of raw images by training said model on our 200,000+ already manually processed images as well as developing a pipeline to automate the processing of re-training of the model on future sets of images.

CANDIDATE REQUIREMENTS

Required Skills: Some Python coding experience is required, and anything beyond is a plus. Previous experience with machine-learning and/or image analysis is preferred, but not necessary. No previous knowledge of wildlife identification or ecological principles is needed, but an interest in the natural sciences and local wildlife is highly encouraged.

Student Eligibility: Master’s, Senior, Junior, Sophomore, Freshman

International Students on F1 or J1 Student Visa: Yes, eligible
School: Vagelos College of Physicians and Surgeons

Department: Pediatrics

Project Overview:

We are looking for a student who will join our studies on the impact of the prenatal environment on brain development. We have developed and are studying a unique mouse model for placental dysfunction that has autism-like behaviors, particularly in male offspring (Vacher et al., Nat Neurosci, 2021). We have RNA sequencing data from multiple brain regions from both mice that had placental insufficiency and matched controls across development. We have examined some of these data sets already but we now aim to analyze the RNA sequencing data specifically from the hippocampus, a critical brain region involved in memory and mood regulation.

The student will utilize bioinformatics tools to analyze RNA sequencing data from the mouse hippocampus. They will identify genes and pathways that are differentially expressed and associated with placental dysfunction and autism. This analysis will be conducted at different developmental stages to identify any deviations in its developmental trajectory of the hippocampus in our autistic model compared to neurotypical brains. The project will also investigate the influence of biological sex, a significant factor in autism. Furthermore, the student will perform statistical analyses to determine the significance of the findings, taking into account variables such as genotype, sex, and age.

Expected Outcomes:

– Identification of differentially expressed genes and pathways associated with placental dysfunction and autism in the hippocampus

– Insights into the molecular mechanisms underlying the link between placental dysfunction and autism

– Contribution to scientific knowledge through research publications and presentations

CANDIDATE REQUIREMENTS

Required Skills: The student should be proficient in R. Familiarity with R packages for RNAseq analysis such as DESeq2, ggplot, and GSEA, as well as visual presentation of sequencing data is a plus. Interest in developmental biology, neuroscience or medicine would be advantageous.

Student Eligibility: Master’s

International Students on F1 or J1 Student Visa: Yes, eligible
School: Vagelos College of Physicians and Surgeons

Department: Zuckerman Mind Brain Behavior Institute | SNF Center for Precision Psychiatry & Mental Health

Project Overview:

Nervous system gives rise to behavior and behavior reflects pathological brain function. Understanding the pathophysiology underlying mental disorders and providing innovative therapeutic avenues requires the detailed study of symptomatology in experimental animal models. During the last decade, pose estimation approaches are revolutionizing animal tracking. Researchers from Columbia University have developed a state-of-the-art machine learning package, namely the Lightning Pose (LP), that tracks freely-moving animals’ pose, enabling to study behavior with unprecedented accuracy. This package provides the 3D coordinates of behaving mice body parts that can then be subjected to various sophisticated analyses of behavior, including its dissection into regressive modules, and the analysis of their transition probabilities through sequences of behavior.

Present project aims to analyze the LP-generated mouse pose data, using available (Keypoint-Moseq, VAME), currently developing (Lightning) and custom-made machine learning or mathematical- and statistical- modeling analysis pipelines. This will allow us to gain novel insights on the effects of rare mutations that are considered to be the strongest etiological factors of schizophrenia currently identified, on behavior and associate them with disease symptomatology. Additionally, this will allow us to configurate a high throughput working pipeline to assess the effects of conventional, and innovative experimental therapeutic approaches in the framework of precision psychiatry.

The student will be working with csv files containing multivariate time series of x,y coordinates for a set of mouse body parts extracted from video data via previously existing algorithms (LP). In this context, the student will use python to apply mathematical and statistical tools under the guidance of the supervisor in order to:

– Detect repetitive patterns in the mouse pose that lead to the identification of behavioral modules/motifs.

– Assess transition probability across these patterns.

– Highlight the differences among different experimental groups (i.e. mutant or drug-treated mice).

CANDIDATE REQUIREMENTS

Required Skills:

– The student is expected to have fluency in python (numpy, scipy, pandas, matplotlib, seaborn) with experience in code writing, pipeline building and debugging.

– Basic statistics, including an understanding of significance testing.

– Basic machine learning (linear regression and classification, clustering).

– Experience with deep learning is a plus (training and evaluating models on GPUs).

– Experience modeling time series is a plus (RNNs, NLP/text analyses, HMMs, Kalman Filters).

– Importantly, the student should be interested in applying their skills to psychiatric neuroscience, and to actively participate in a collaborative working environment.

Student Eligibility: Master’s, Senior

International Students on F1 or J1 Student Visa: Yes, eligible
School: Columbia University Irving Medical Center

Department: Department of Biomedical Informatics

Funding Note: This is a grant funded project. Exact amount of funding will depend on hours completed.

Project Overview:

Electronic health records (EHR) provide a population-scale resource to improve the diagnoses of rare diseases, which go unrecognized by most providers due to lack of familiarity. This project aims to leverage cutting-edge biomedical informatics and data science methodology to develop, validate, and demonstrate the clinical utility of an EHR-driven approach for rare diseases clinical decision support systems. Support for diagnosis of rare diseases will enable patients and providers to move efficiently beyond diagnoses to treatments and support for their condition. The types of tasks include training early diagnostic models using EHR data, optimizing the model to overcome any potential bias across different genetic ancestries, and developing visualization tools to provide an explanatory dashboard for clinical decision support. In this project, our aim is to develop a methodology to efficiently identify potential rare disease candidates from large EHR pools. The identified dataset will subsequently undergo manual review and labeling to serve as a training dataset for other supervised learning tasks. The end goal includes manuscript submission and a reproducible pipeline that can be generalized to other external institutions.

CANDIDATE REQUIREMENTS

Required Skills:

– Proficiency in programming languages such as R and Python is essential. Familiar with packages such as pandas or dplyr. The student should be able to write clean, efficient code to extract insights from data. The student with experience in working with diverse datasets, including longitudinal data, and structured/unstructured sources, is highly valuable. The student should possess the ability to clean, preprocess, and integrate data effectively. Skills in data visualization tools and libraries (e.g., Matplotlib, Seaborn, ggplot2) are a big plus.

– A strong foundation in machine learning and statistical analysis is necessary for building predictive models, conducting hypothesis testing, and extracting meaningful patterns from data.

– Skills with front-end app development (React) or experience with Javascript will be a big plus

– Experience with natural language processing and large language model will be a big plus

Student Eligibility: Master’s, Senior

International Students on F1 or J1 Student Visa: Yes, eligible

Spring Projects 2024 (Closed)

School: School of Nursing

Department: School of Nursing

Average Hours per Week: Approximately 10

Stipend Amount: $3,000

Project Overview: The DSI Scholar will work on cleaning and analyzing data for two projects focused on the influence of daily discrimination on cardiovascular disease risk. The first project is our DSI Seed Grant, which was a 30-day daily diary study that investigated the impact of daily discrimination on sleep health in a sample of Black and Latinx LGBTQ+ Adults. Specifically, we plan to use unsupervised machine learning to identify sleep phenotypes and their associations with daily discrimination. The next project, which was recently funded by the National Heart, Lung, and Blood Institute, is a 1-week daily diary study that investigates the influence of anticipated and vicarious discrimination on home blood pressure. The DSI Scholar will assist our team with developing and maintaining our study database, completing ongoing data cleaning, tracking data collection, developing code for data analysis, and addressing any data concerns (as needed).

CANDIDATE REQUIREMENTS

Required Skills: The DSI Scholar should have fluency in R/Python and an interest in health disparities research. Familiarity with machine learning and multilevel modeling is preferred but not necessary. We have longitudinal sensor and daily diary data that the DSI Scholar will help analyze but prior experience is not needed.

Student Eligibility: Master’s

International Students on F1 or J1 Student Visa: Yes, eligible
School: Climate School

Department: Lamont-Doherty Earth Observatory

Average Hours per Week: Approximately 10

Stipend Amount: $3000

Project Overview: Our group has recently developed a new method for computing the flow of light through the earth’s atmosphere (https://doi.org/10.1029/2023MS003819) – as task that’s key to climate projections and weather forecasting. The method relies on data-driven optimization: one defines a set of states over which to optimize, makes detailed, computationally expensive reference calculations based on those states, then identifies a very small optimal subset of the reference calculations that can be used as a proxy for the fully detailed calculations. The method is appealing in part because it’s flexible – it can be applied to arbitrary conditions with arbitrary cost functions for optimization.

We’d like to make it easier for people to use this idea for their own purposes, starting with using the tools ourselves to do a more complete and complicated version of the idealized problem we first took on. One task will be taking the original set of (clean, modular!) Python scripts and Jupyter notebooks and developing these into a fully general Python package that can be distributed via PyPi and Conda for wider use. During the course of this development we’ll apply the tools to the complete range of greenhouse gasses in the atmosphere, which may require identifying or developing smarter ways of allowing many small contributors to vary at once.

If successful the project stands to have an immediate impact – the group has collaborators at both weather forecasting and climate modeling centers who are interested in using a mature version of this technique.

CANDIDATE REQUIREMENTS

Required Skills: The project requires fluency in scientific Python, the ability to refactor code from scripts into Python modules, and the willingness to develop automated testing, packaging, and distribution. Ability and willingness to discuss the underlying physical science would be an advantage.

Student Eligibility: Master’s, Senior

International Students on F1 or J1 Student Visa: Yes, eligible
School: Vagelos Physicians & Surgeons

Department: Medicine

Average Hours per Week: Approximately 10

Stipend Amount: $3,000

Project Overview: Diagnostic errors affect up to 12 million adults per year and result in serious harm or death. Incorrectly ordered imaging tests are a major cause of missed diagnoses; however, little is known about why these errors occur. Current methods measuring imaging order errors are limited by reporting bias and the need for chart review. To address these gaps, I propose applying an innovative, systematic approach, the Retract-and-Reorder (RAR) method, to develop automated measures to identify imaging order errors. Electronic health record data (EHR) will be queried to identify imaging RAR events, defined as imaging orders placed, retracted, and subsequently reordered for the same patient with an element of the order changed. We aim use the RAR method to detect imaging order errors with a high accuracy. I aim to develop the first automated wrong-imaging order error measures to 1) examine the epidemiology of imaging order errors in a large healthcare system and 2) provide reliable outcome data for studies to trial system-level interventions to reduce these types of errors, to improve diagnostic safety and accuracy. Specific tasks will include working with a preexisting relational database in a server from the department of biomedical informatics. This database will have robust EHR clinical and log data. From this database will use data-driven methods to design the queries for the measures to identify diagnostic imaging order errors. We will use quantitative and qualitative analyses in a mixed-methods research approach to inform query specifications to identify these types of errors with high accuracy.

CANDIDATE REQUIREMENTS

Required Skills: Fluency in SQL Server Management Studio is preferred, but not necessary. Fluency in SQL, Python, or R is also preferred, but not necessary.

Student Eligibility: Master’s

International Students on F1 or J1 Student Visa: Not eligible
School: Climate School

Department: Center for International Earth Science Information Network (CIESIN)

Average Hours per Week: Approximately 10

Stipend Amount: $3,000

Project Overview: CIESIN is interested in identifying open plastic dumps that are potentially vulnerable to climate change. The health and environmental risks as well as social justice issues posed by open plastic dumps can be compounded by climate change events.

A DSI Scholar would provide coding and other technical support within the context of this global plastics project through two parallel work streams.The first is to extract values for land use disturbance, flooding, changes in rainfall and temperature extremes, and demographic information from large datasets and assign these to the polygons delineating plastic dumps boundaries over time. The resulting dataset will be explored to identify the climate risks associated with individual open dumps and the populations that could be impacted. The expected platform to be used is Google Earth Engine and coding in python or Java. The second workstream is to locate and link plastic trade related import and export data to the relevant countries and potentially the actual open plastic dumps.

CANDIDATE REQUIREMENTS

Required Skills: Fluency in scripting languages for data analysis experience with import export data preferred

Student Eligibility: Master’s, Senior, Junior, Sophomore, Freshman

International Students on F1 or J1 Student Visa: Yes, eligible
School: Teachers College

Department: Human Development

Average Hours per Week: Approximately 10

Stipend Amount: $3,000

Project Overview: Generative AI has shown great promise for education, but who it might actually benefit in practice is a serious equity concern. This project aims to shed light on this dilemma by examining systematic disparities in public responses to generate AI in education, including 1) institutional academic policies; 2) students’ online discussions; and 3) relationships between these responses and institutional characteristics. Project tasks may include: 1) acquiring and cleaning large-scale text and administrative data via web scraping or APIs; 2) performing NLP tasks such as sentiment analysis and topic modeling, potentially using LLMs; and 3) statistical analyses, reporting, and data visualization, including geospatial mapping. The findings will provide solid empirical evidence on digital inequalities in the emergence of generative AI and inform best practices to improve educational equity through these technologies.

CANDIDATE REQUIREMENTS

Required Skills:Qualified students should be skilled in NLP (with Pytorch, Hugging Face, etc.), statistical methods (with R), and have a strong interest in computational social science and a passion for social good. The scholar will work with the research team to contribute to all aspects of the project and lead additional analyses. Students who intend to pursue a doctoral degree in the future is a plus.

Student Eligibility: Master’s, Senior

International Students on F1 or J1 Student Visa: Yes, eligible
School: Vagelos College of Physicians & Surgeons

Department: Genetics & Development (in Systems Biology)

Average Hours per Week: Approximately 10

Stipend Amount: $3,000

Project Overview: We are seeking an enthusiastic and motivated undergraduate student to join our research team as an intern, focusing on the analysis of microscopy data to study chromosome rearrangement and loss of heterozygosity (LOH) after DNA damage. LOH is a principle driver of cancer progression and understanding how it is generated after DNA damage has implications for cancer biology. This internship provides a unique opportunity to contribute to cutting-edge research in genetics. The selected candidate will work closely with experienced researchers and gain valuable skills in data analysis and scientific research techniques.

Key Responsibilities:Microscopy Data Analysis: Analyze microscopy images to study chromosome structure and organization after DNA damage. This includes writing scripts for specialized software to quantify chromosomal aberrations, measure distances between specific chromosomal regions, and assess the overall impact of DNA damage on chromosome rearrangement.

Data Interpretation: Interpret and document the results of microscopy analyses, identifying patterns and trends related to chromosome rearrangement. Identify data features for development of machine learning protocols to classify recombination outcomes. Collaborate with colleagues in Systems Biology to implement the algorithm to draw meaningful conclusions from the data and contribute to scientific discussions.

Literature Review: Stay up-to-date with relevant scientific literature on mitotic recombination, LOH, chromosome rearrangement and DNA damage. Summarize and present key findings to the research team.

Documentation: Maintain detailed records of analysis methods, results, and conclusions. Prepare comprehensive documentation and reports for inclusion in scientific publications.

Visualization: Generate clear and informative visual representations of the analyzed data, including graphs, charts, and figures, to facilitate data interpretation and presentation.”

CANDIDATE REQUIREMENTS

Required Skills: Strong interest in genomics, DNA damage response, and chromosome biology and the desire to help develop large scale data analysis for a microscopy problem. Basic understanding of microscopy techniques, image analysis and familiarity with data analysis software and programming languages (such as Python, R, or ImageJ) would be a plus. Excellent attention to detail, analytical skills, and ability to work independently. Strong communication skills and ability to work effectively in a team-oriented environment.

Student Eligibility: Master’s, Senior, Junior, Sophomore, Freshman

International Students on F1 or J1 Student Visa: Yes, eligible
School: Arts and Sciences

Department: Columbia Justice Lab

Average Hours per Week: Approximately 10

Stipend Amount: $3,000

Project Overview: The Probation and Parole Reform Project (PPRP), housed in the Columbia Justice Lab, conducts actionable research that challenges the way probation and parole operate in the U.S. We envision a world where probation and parole are smaller, less punitive, equitable, and helpful, and where resources are invested directly to communities in ways that advance collective efficacy, opportunity, and racial equity. As a key part of this work, we seek to understand and publicize the full carceral impact of probation and parole policies, also known as community supervision – a key area of concern is jail detention for technical supervision violations.

While probation and parole were designed to divert people away from incarceration, community supervision is often attached to fees, curfews, and employment or programming mandates. When someone is unable to fulfill these conditions they become at risk of arrest or incarceration due to a technical violation of supervision requirements. Community supervision casts a wide net, surveilling three times as many people as there are in prisons. However, the number of people being incarcerated due to community supervision violations is not captured in current data or policy analysis.

The DSI Scholar will leverage a recently-available jail data to better capture the larger footprint of community supervision, and to identify inequalities in incarceration due to probation and parole across time and space. The dataset contains individual level arrest data for probation and parole violations scraped daily from over 1000 publicly available jail rosters in the U.S. since 2019. The end goal would be to use this data to highlight and better understand the full scope of incarcerations due to technical violations, and design empirically grounded policy recommendations on how to minimize incarceration and reduce racial inequalities within community supervision.

CANDIDATE REQUIREMENTS

Required Skills: The scholar must be proficient in R and experience with Python is a plus. They should also have experience with web-scraping and database management for large, longitudinal datasets. Experience with data visualization is also essential, including graphical presentations of longitudinal data as well as experience working with and presenting spatial data.

We are also interested in linking administrative datasets. For example, linking jail rosters to voter registration data. For this, the ability to automate data cleaning processes is also highly encouraged. For example, designing algorithms to match individuals across multiple arrest records even when their name is misspelled in a subset of observations.

Student Eligibility: Master’s, Senior

International Students on F1 or J1 Student Visa: Yes, eligible
School: School of Engineering and Applied Science

Department: Civil Engineering & Engineering Mechanics

Average Hours per Week: Approximately 10

Stipend Amount: $3,000

Project Overview: In recent years, Large Language Models (LLMs), such as GPT-3, GPT-4, and LLama 2, are algorithms trained on extensive datasets, exhibiting exceptional zero-shot learning capabilities across numerous unlabelled tasks. Building on this notion, in-context learning involves conditioning LLMs on specific linguistic instructions or task demonstrations, subsequently enabling them to tackle analogous tasks through sequence predictions. In the field of Travel Mode Analysis, a significant volume of unlabeled data exists. Of particular interest are the unlabelled tweets generated by commuters, which offer insights into evolving travel patterns, especially in the context of events like a pandemic. By harnessing the strengths of LLMs and in-context learning, there exists potential to extract valuable insights from unlabelled data.

CANDIDATE REQUIREMENTS

Required Skills: Experience in coding in Python. Experience in NLP and PyTorch is preferred.

Student Eligibility: Master’s

International Students on F1 or J1 Student Visa: Yes, eligible
School: Columbia Climate School

Department: Advanced Consortium on Cooperation, Conflict and Complexity (AC4)

Average Hours per Week: Approximately 10

Stipend Amount: $3,000

Project Overview: “Hate Speech” is a term used by peacebuilders, content moderators, policy-makers, and others, to label and categorize language, especially as it shows up in digital media. It is associated with inciting conflict and violence, and it may reflect the conditions of social relations among people across nations. Yet, while hate speech continues, so do other forms of speech that may reflect prosocial behaviors among people around the world as well. What are the properties of this “Peace Speech” that may lead to better outcomes and support continued and sustainable peaceful conditions in nations throughout the world?

Our interdisciplinary team in the Advanced Consortium on Cooperation, Conflict and Complexity (AC4) at the Columbia Climate School includes researchers in psychology, social psychology, environmental sustainability, natural resource governance, and applied anthropology. Together, our team is working to identify linguistic differences from peaceful and less peaceful societies, and the features of “Peace Speech”, that may reflect and support social processes underlying sustainably peaceful conditions. Using 3 data bases, we have already identified individual words that machine learning models use to best classify nations as lower or higher peace. See for example, https://arxiv.org/abs/2305.12537 We now want to cluster those words into topics to identify which topics are most important in differentiating lower and higher peace countries, so that we can gain insight into the social processes that those topics represent.

CANDIDATE REQUIREMENTS

Required Skills: Fluency in Python, natural language processing (cleaning text, NLTK, spaCy, Google’s BERT, HuggingFace XLnet), longitudinal analysis (time series), clustering analysis (k-means, word2vec, cosine similarity, ChatGTP), machine learning (logistic regression, random forest, XGBoost, support vector machines, neural networks, deep learning). The short term goal is to identify the topics in news and social media that best classifies lower and higher peace countries, topics such as governance, politics, international relations, work, everyday life activities, economics, arts, personal preferences, hobbies, etc. The longer term goal is to use machine learning and AI to identify the social processes that underlie “Peace Speech”.

Student Eligibility: Master’s, Senior, Junior, Sophomore, Freshman

International Students on F1 or J1 Student Visa: Yes, eligible
School: Climate School

Department: International Research Institute for Climate and Society (IRI) and Department of Earth and Environmental Sciences (DEES)

Average Hours per Week: Approximately 10

Funding: This is a grant funded project. Exact amount of funding will depend on hours completed.

Project Overview: Sub-seasonal-to-seasonal (S2S; weeks to months) time-scale predictions have great potential societal benefits, such as early warnings of heavy rains, droughts, and heat waves. Reliable forecasts a few weeks ahead can provide invaluable tools for routine planning in the agriculture, water resources, public health, and humanitarian aid sectors. However, the skill of current S2S forecasts made using large physics-based climate model ensembles is limited, partly because model simulations depart from reality. Calibration of climate-model forecast probabilities is necessary to account for model deficiencies and produce reliable forecasts. State-of-the-art calibration methods in IRI’s climate predictability tools are linear, which is a limitation because the atmospheric flow is inherently non-linear, and model errors often grow exponentially.

The goal of this project is to develop machine learning/artificial intelligence (ML/AI) forecast tools that enable non-linear bias correction to meet the growing service demands on improved forecast products at S2S time scales. The intern will code and run test cases to compare the performance of different ML methods (e.g., Regression Trees, CNNs, deep learning) to improve Indian summer monsoon probabilistic forecast skill by bias-correcting/calibrating sets of S2S forecast ensembles from large physics-based climate models run at global climate forecasting centers (e.g., NCEP, ECMWF) and archived in IRI Data Library.

CANDIDATE REQUIREMENTS

Required Skills: Fluency in Python coding and libraries, Jupyter Notebooks, and use of GitHub repos. Experience with using various ML methods (e.g., Regression Trees, CNNs, deep learning) is required. Experience with large climate data and model output datasets would be an advantage, but is not required.

Student Eligibility: Master’s, Senior, Junior

International Students on F1 or J1 Student Visa: Yes, eligible
School: Medicine

Department: Pathology & Cell Biology

Average Hours per Week: Approximately 10

Funding: This is a grant funded project. Exact amount of funding will depend on hours completed.

Project Overview: The brain is the most complex organ in the body, composed of billions of neurons and trillions of connections between those neurons. Those connections are known as synapses and have been for many years the subject of intense study. What is less clear, however, is how synapses are organized at a population level throughout the brain. To start to address this, we developed a method that analyzes individual synapses using spatial and intensity metrics and scaled this approach to analyze hundreds of thousands of synapses concurrently. By doing so, we found that synapses fall into previously unknown, but functionally-relevant, subpopulations. The student project, which is a collaboration between 2 groups (the Au lab in Pathology and Cell Biology and Menon lab in Neurology) will be to help identify synaptic subpopulations under various experimental conditions and to and to analyze their spatial arrangement in the brain. This will help to reveal functional submotifs in the cortex and glean novel insights into cortical circuit organization.

CANDIDATE REQUIREMENTS

Required Skills: Fluency in python is a must. Experience with machine learning, pytorch and scanpy preferred. Experience with multidimensional image analysis ideal.

Student Eligibility: Master’s, Senior

International Students on F1 or J1 Student Visa: Not Eligible