DSI Scholars Projects
Spring 2024 Projects
The DSI Scholars Spring 2024 Student Application is now available. Students interested in participating in this program are encouraged to review the list of available projects below. Before applying, please be sure to read the descriptions carefully to ensure you meet the eligibility requirements and prerequisites for each project.
Please note that you are welcome to apply to as many projects that you are interested in.
For more information about the program, including the program benefits, application process and timeline, please visit the DSI Scholars Student Information Page.
Faculty interested in participating in Summer or Fall 2024 are encouraged to review the DSI Scholars Faculty Information page for details.
Important Dates:
- November 6, 2023 (11:59 PM ET): Student applications due
- November 27 – December 8, 2023 (expected): Student interviews
- December 20, 2023 (expected): Decision notification will be sent to all applicants by email.
- January 2024 – May 2024: Get started on your Scholars research project
Spring Projects 2024
-
School: School of Nursing
Department: School of Nursing
Average Hours per Week: Approximately 10
Stipend Amount: $3,000
Project Overview: The DSI Scholar will work on cleaning and analyzing data for two projects focused on the influence of daily discrimination on cardiovascular disease risk. The first project is our DSI Seed Grant, which was a 30-day daily diary study that investigated the impact of daily discrimination on sleep health in a sample of Black and Latinx LGBTQ+ Adults. Specifically, we plan to use unsupervised machine learning to identify sleep phenotypes and their associations with daily discrimination. The next project, which was recently funded by the National Heart, Lung, and Blood Institute, is a 1-week daily diary study that investigates the influence of anticipated and vicarious discrimination on home blood pressure. The DSI Scholar will assist our team with developing and maintaining our study database, completing ongoing data cleaning, tracking data collection, developing code for data analysis, and addressing any data concerns (as needed).
CANDIDATE REQUIREMENTS
Required Skills: The DSI Scholar should have fluency in R/Python and an interest in health disparities research. Familiarity with machine learning and multilevel modeling is preferred but not necessary. We have longitudinal sensor and daily diary data that the DSI Scholar will help analyze but prior experience is not needed.
Student Eligibility: Master’s
International Students on F1 or J1 Student Visa: Yes eligible
-
School: Climate School
Department: Lamont-Doherty Earth Observatory
Average Hours per Week: Approximately 10
Stipend Amount: $3000
Project Overview: Our group has recently developed a new method for computing the flow of light through the earth’s atmosphere (https://doi.org/10.1029/2023MS003819) – as task that’s key to climate projections and weather forecasting. The method relies on data-driven optimization: one defines a set of states over which to optimize, makes detailed, computationally expensive reference calculations based on those states, then identifies a very small optimal subset of the reference calculations that can be used as a proxy for the fully detailed calculations. The method is appealing in part because it’s flexible – it can be applied to arbitrary conditions with arbitrary cost functions for optimization.
We’d like to make it easier for people to use this idea for their own purposes, starting with using the tools ourselves to do a more complete and complicated version of the idealized problem we first took on. One task will be taking the original set of (clean, modular!) Python scripts and Jupyter notebooks and developing these into a fully general Python package that can be distributed via PyPi and Conda for wider use. During the course of this development we’ll apply the tools to the complete range of greenhouse gasses in the atmosphere, which may require identifying or developing smarter ways of allowing many small contributors to vary at once.
If successful the project stands to have an immediate impact – the group has collaborators at both weather forecasting and climate modeling centers who are interested in using a mature version of this technique.
CANDIDATE REQUIREMENTS
Required Skills: The project requires fluency in scientific Python, the ability to refactor code from scripts into Python modules, and the willingness to develop automated testing, packaging, and distribution. Ability and willingness to discuss the underlying physical science would be an advantage.
Student Eligibility: Master’s; Senior
International Students on F1 or J1 Student Visa: Yes eligible
-
School: Vagelos Physicians & Surgeons
Department: Medicine
Average Hours per Week: Approximately 10
Stipend Amount: $3,000
Project Overview: Diagnostic errors affect up to 12 million adults per year and result in serious harm or death. Incorrectly ordered imaging tests are a major cause of missed diagnoses; however, little is known about why these errors occur. Current methods measuring imaging order errors are limited by reporting bias and the need for chart review. To address these gaps, I propose applying an innovative, systematic approach, the Retract-and-Reorder (RAR) method, to develop automated measures to identify imaging order errors. Electronic health record data (EHR) will be queried to identify imaging RAR events, defined as imaging orders placed, retracted, and subsequently reordered for the same patient with an element of the order changed. We aim use the RAR method to detect imaging order errors with a high accuracy. I aim to develop the first automated wrong-imaging order error measures to 1) examine the epidemiology of imaging order errors in a large healthcare system and 2) provide reliable outcome data for studies to trial system-level interventions to reduce these types of errors, to improve diagnostic safety and accuracy. Specific tasks will include working with a preexisting relational database in a server from the department of biomedical informatics. This database will have robust EHR clinical and log data. From this database will use data-driven methods to design the queries for the measures to identify diagnostic imaging order errors. We will use quantitative and qualitative analyses in a mixed-methods research approach to inform query specifications to identify these types of errors with high accuracy.
CANDIDATE REQUIREMENTS
Required Skills: Fluency in SQL Server Management Studio is preferred, but not necessary. Fluency in SQL, Python, or R is also preferred, but not necessary.
Student Eligibility: Master’s
International Students on F1 or J1 Student Visa: Not eligible
-
School: Climate School
Department: Center for International Earth Science Information Network (CIESIN)
Average Hours per Week: Approximately 10
Stipend Amount: $3,000
Project Overview: CIESIN is interested in identifying open plastic dumps that are potentially vulnerable to climate change. The health and environmental risks as well as social justice issues posed by open plastic dumps can be compounded by climate change events.
A DSI Scholar would provide coding and other technical support within the context of this global plastics project through two parallel work streams.The first is to extract values for land use disturbance, flooding, changes in rainfall and temperature extremes, and demographic information from large datasets and assign these to the polygons delineating plastic dumps boundaries over time. The resulting dataset will be explored to identify the climate risks associated with individual open dumps and the populations that could be impacted. The expected platform to be used is Google Earth Engine and coding in python or Java. The second workstream is to locate and link plastic trade related import and export data to the relevant countries and potentially the actual open plastic dumps.
CANDIDATE REQUIREMENTS
Required Skills: Fluency in scripting languages for data analysis experience with import export data preferred
Student Eligibility: Master’s, Senior, Junior, Sophomore, Freshman
International Students on F1 or J1 Student Visa: Eligible
-
School: Teachers College
Department: Human Development
Average Hours per Week: Approximately 10
Stipend Amount: $3,000
Project Overview: Generative AI has shown great promise for education, but who it might actually benefit in practice is a serious equity concern. This project aims to shed light on this dilemma by examining systematic disparities in public responses to generate AI in education, including 1) institutional academic policies; 2) students’ online discussions; and 3) relationships between these responses and institutional characteristics. Project tasks may include: 1) acquiring and cleaning large-scale text and administrative data via web scraping or APIs; 2) performing NLP tasks such as sentiment analysis and topic modeling, potentially using LLMs; and 3) statistical analyses, reporting, and data visualization, including geospatial mapping. The findings will provide solid empirical evidence on digital inequalities in the emergence of generative AI and inform best practices to improve educational equity through these technologies.
CANDIDATE REQUIREMENTS
Required Skills:Qualified students should be skilled in NLP (with Pytorch, Hugging Face, etc.), statistical methods (with R), and have a strong interest in computational social science and a passion for social good. The scholar will work with the research team to contribute to all aspects of the project and lead additional analyses. Students who intend to pursue a doctoral degree in the future is a plus.
Student Eligibility: Master’s, Senior
International Students on F1 or J1 Student Visa: Eligible
-
School: Vagelos College of Physicians & Surgeons
Department: Genetics & Development (in Systems Biology)
Average Hours per Week: Approximately 10
Stipend Amount: $3,000
Project Overview: We are seeking an enthusiastic and motivated undergraduate student to join our research team as an intern, focusing on the analysis of microscopy data to study chromosome rearrangement and loss of heterozygosity (LOH) after DNA damage. LOH is a principle driver of cancer progression and understanding how it is generated after DNA damage has implications for cancer biology. This internship provides a unique opportunity to contribute to cutting-edge research in genetics. The selected candidate will work closely with experienced researchers and gain valuable skills in data analysis and scientific research techniques.
Key Responsibilities:Microscopy Data Analysis: Analyze microscopy images to study chromosome structure and organization after DNA damage. This includes writing scripts for specialized software to quantify chromosomal aberrations, measure distances between specific chromosomal regions, and assess the overall impact of DNA damage on chromosome rearrangement.
Data Interpretation: Interpret and document the results of microscopy analyses, identifying patterns and trends related to chromosome rearrangement. Identify data features for development of machine learning protocols to classify recombination outcomes. Collaborate with colleagues in Systems Biology to implement the algorithm to draw meaningful conclusions from the data and contribute to scientific discussions.
Literature Review: Stay up-to-date with relevant scientific literature on mitotic recombination, LOH, chromosome rearrangement and DNA damage. Summarize and present key findings to the research team.
Documentation: Maintain detailed records of analysis methods, results, and conclusions. Prepare comprehensive documentation and reports for inclusion in scientific publications.
Visualization: Generate clear and informative visual representations of the analyzed data, including graphs, charts, and figures, to facilitate data interpretation and presentation.”
CANDIDATE REQUIREMENTS
Required Skills: Strong interest in genomics, DNA damage response, and chromosome biology and the desire to help develop large scale data analysis for a microscopy problem. Basic understanding of microscopy techniques, image analysis and familiarity with data analysis software and programming languages (such as Python, R, or ImageJ) would be a plus. Excellent attention to detail, analytical skills, and ability to work independently. Strong communication skills and ability to work effectively in a team-oriented environment.
Student Eligibility: Master’s, Senior, Junior, Sophomore, Freshman
International Students on F1 or J1 Student Visa: Eligible
-
School: Arts and Sciences
Department: Columbia Justice Lab
Average Hours per Week: Approximately 10
Stipend Amount: $3,000
Project Overview: The Probation and Parole Reform Project (PPRP), housed in the Columbia Justice Lab, conducts actionable research that challenges the way probation and parole operate in the U.S. We envision a world where probation and parole are smaller, less punitive, equitable, and helpful, and where resources are invested directly to communities in ways that advance collective efficacy, opportunity, and racial equity. As a key part of this work, we seek to understand and publicize the full carceral impact of probation and parole policies, also known as community supervision – a key area of concern is jail detention for technical supervision violations.
While probation and parole were designed to divert people away from incarceration, community supervision is often attached to fees, curfews, and employment or programming mandates. When someone is unable to fulfill these conditions they become at risk of arrest or incarceration due to a technical violation of supervision requirements. Community supervision casts a wide net, surveilling three times as many people as there are in prisons. However, the number of people being incarcerated due to community supervision violations is not captured in current data or policy analysis.
The DSI Scholar will leverage a recently-available jail data to better capture the larger footprint of community supervision, and to identify inequalities in incarceration due to probation and parole across time and space. The dataset contains individual level arrest data for probation and parole violations scraped daily from over 1000 publicly available jail rosters in the U.S. since 2019. The end goal would be to use this data to highlight and better understand the full scope of incarcerations due to technical violations, and design empirically grounded policy recommendations on how to minimize incarceration and reduce racial inequalities within community supervision.
CANDIDATE REQUIREMENTS
Required Skills: The scholar must be proficient in R and experience with Python is a plus. They should also have experience with web-scraping and database management for large, longitudinal datasets. Experience with data visualization is also essential, including graphical presentations of longitudinal data as well as experience working with and presenting spatial data.
We are also interested in linking administrative datasets. For example, linking jail rosters to voter registration data. For this, the ability to automate data cleaning processes is also highly encouraged. For example, designing algorithms to match individuals across multiple arrest records even when their name is misspelled in a subset of observations.
Student Eligibility: Master’s, Senior
International Students on F1 or J1 Student Visa: Eligible
-
School: School of Engineering and Applied Science
Department: Civil Engineering & Engineering Mechanics
Average Hours per Week: Approximately 10
Stipend Amount: $3,000
Project Overview: In recent years, Large Language Models (LLMs), such as GPT-3, GPT-4, and LLama 2, are algorithms trained on extensive datasets, exhibiting exceptional zero-shot learning capabilities across numerous unlabelled tasks. Building on this notion, in-context learning involves conditioning LLMs on specific linguistic instructions or task demonstrations, subsequently enabling them to tackle analogous tasks through sequence predictions. In the field of Travel Mode Analysis, a significant volume of unlabeled data exists. Of particular interest are the unlabelled tweets generated by commuters, which offer insights into evolving travel patterns, especially in the context of events like a pandemic. By harnessing the strengths of LLMs and in-context learning, there exists potential to extract valuable insights from unlabelled data.
CANDIDATE REQUIREMENTS
Required Skills: Experience in coding in Python. Experience in NLP and PyTorch is preferred.
Student Eligibility: Master’s
International Students on F1 or J1 Student Visa: Eligible
-
School: Columbia Climate School
Department: Advanced Consortium on Cooperation, Conflict and Complexity (AC4)
Average Hours per Week: Approximately 10
Stipend Amount: $3,000
Project Overview: “Hate Speech” is a term used by peacebuilders, content moderators, policy-makers, and others, to label and categorize language, especially as it shows up in digital media. It is associated with inciting conflict and violence, and it may reflect the conditions of social relations among people across nations. Yet, while hate speech continues, so do other forms of speech that may reflect prosocial behaviors among people around the world as well. What are the properties of this “Peace Speech” that may lead to better outcomes and support continued and sustainable peaceful conditions in nations throughout the world?
Our interdisciplinary team in the Advanced Consortium on Cooperation, Conflict and Complexity (AC4) at the Columbia Climate School includes researchers in psychology, social psychology, environmental sustainability, natural resource governance, and applied anthropology. Together, our team is working to identify linguistic differences from peaceful and less peaceful societies, and the features of “Peace Speech”, that may reflect and support social processes underlying sustainably peaceful conditions. Using 3 data bases, we have already identified individual words that machine learning models use to best classify nations as lower or higher peace. See for example, https://arxiv.org/abs/2305.12537 We now want to cluster those words into topics to identify which topics are most important in differentiating lower and higher peace countries, so that we can gain insight into the social processes that those topics represent.
CANDIDATE REQUIREMENTS
Required Skills: Fluency in Python, natural language processing (cleaning text, NLTK, spaCy, Google’s BERT, HuggingFace XLnet), longitudinal analysis (time series), clustering analysis (k-means, word2vec, cosine similarity, ChatGTP), machine learning (logistic regression, random forest, XGBoost, support vector machines, neural networks, deep learning). The short term goal is to identify the topics in news and social media that best classifies lower and higher peace countries, topics such as governance, politics, international relations, work, everyday life activities, economics, arts, personal preferences, hobbies, etc. The longer term goal is to use machine learning and AI to identify the social processes that underlie “Peace Speech”.
Student Eligibility: Master’s, Senior, Junior, Sophomore, Freshman
International Students on F1 or J1 Student Visa: Eligible
-
School: Climate School
Department: International Research Institute for Climate and Society (IRI) and Department of Earth and Environmental Sciences (DEES)
Average Hours per Week: Approximately 10
Funding: This is a grant funded project. Exact amount of funding will depend on hours completed.
Project Overview: Sub-seasonal-to-seasonal (S2S; weeks to months) time-scale predictions have great potential societal benefits, such as early warnings of heavy rains, droughts, and heat waves. Reliable forecasts a few weeks ahead can provide invaluable tools for routine planning in the agriculture, water resources, public health, and humanitarian aid sectors. However, the skill of current S2S forecasts made using large physics-based climate model ensembles is limited, partly because model simulations depart from reality. Calibration of climate-model forecast probabilities is necessary to account for model deficiencies and produce reliable forecasts. State-of-the-art calibration methods in IRI’s climate predictability tools are linear, which is a limitation because the atmospheric flow is inherently non-linear, and model errors often grow exponentially.
The goal of this project is to develop machine learning/artificial intelligence (ML/AI) forecast tools that enable non-linear bias correction to meet the growing service demands on improved forecast products at S2S time scales. The intern will code and run test cases to compare the performance of different ML methods (e.g., Regression Trees, CNNs, deep learning) to improve Indian summer monsoon probabilistic forecast skill by bias-correcting/calibrating sets of S2S forecast ensembles from large physics-based climate models run at global climate forecasting centers (e.g., NCEP, ECMWF) and archived in IRI Data Library.
CANDIDATE REQUIREMENTS
Required Skills: Fluency in Python coding and libraries, Jupyter Notebooks, and use of GitHub repos. Experience with using various ML methods (e.g., Regression Trees, CNNs, deep learning) is required. Experience with large climate data and model output datasets would be an advantage, but is not required.
Student Eligibility: Master’s, Senior, Junior
International Students on F1 or J1 Student Visa: Eligible
-
School: Medicine
Department: Pathology & Cell Biology
Average Hours per Week: Approximately 10
Funding: This is a grant funded project. Exact amount of funding will depend on hours completed.
Project Overview: The brain is the most complex organ in the body, composed of billions of neurons and trillions of connections between those neurons. Those connections are known as synapses and have been for many years the subject of intense study. What is less clear, however, is how synapses are organized at a population level throughout the brain. To start to address this, we developed a method that analyzes individual synapses using spatial and intensity metrics and scaled this approach to analyze hundreds of thousands of synapses concurrently. By doing so, we found that synapses fall into previously unknown, but functionally-relevant, subpopulations. The student project, which is a collaboration between 2 groups (the Au lab in Pathology and Cell Biology and Menon lab in Neurology) will be to help identify synaptic subpopulations under various experimental conditions and to and to analyze their spatial arrangement in the brain. This will help to reveal functional submotifs in the cortex and glean novel insights into cortical circuit organization.
CANDIDATE REQUIREMENTS
Required Skills: Fluency in python is a must. Experience with machine learning, pytorch and scanpy preferred. Experience with multidimensional image analysis ideal.
Student Eligibility: Master’s, Senior
International Students on F1 or J1 Student Visa: Not Eligible