DSI Scholars Projects
Spring 2025 Projects
An additional project has just been added for the DSI Scholars Spring 2025 Cohort
Spring 2025: Special Application – Project 14: Apply here
Deadline: Sunday, November 10, 2024 at 11:59 PM
Eligibility
Please note that in order to be considered for Spring 2025 projects, students must be enrolled for the Spring 2025 semester. Students that are graduating in December are not eligible for the Spring application.
Applying to More than One Project
Students are welcome to apply for up to 5 projects per term. You must submit a separate application for each project. If you submit more than 5 applications, we will randomly select 5 of your applications for submission.
Students interested in participating in this program are encouraged to review the list of available projects below. Before applying, please be sure to read the descriptions carefully to ensure you meet the eligibility requirements and prerequisites for each project.
For more information about the program, including the program benefits, application process and timeline, please visit the DSI Scholars Student Information Page.
Faculty interested in participating in Spring 2025 are encouraged to review the DSI Scholars Faculty Information page for details.
Important Dates:
- August 19, 2024: Faculty application opens
- September 15, 2024 (11:59 PM ET): Faculty applications are due
- October 4, 2024: Student application opens
- October 20, 2024 (11:59 PM ET): Student applications due
- October 28 – November 8, 2024 (expected): Student interviews
- November 22, 2024 (expected): Decision notification will be sent to all applicants by email
- January 2025 – May 2025: Get started on your Scholars research project (Approximate duration. Exact dates will depend on the faculty supervisor)
Spring 2025 Projects
-
School: Vagelos College of Physicians & Surgeons
Department: Psychiatry
Project Overview: Decoding behavioral signifiers for the brain states and decisions can have far reaching implications for understanding the neural basis for actions and identifying disease. We are using high resolution video recordings of mice as they navigate mazes but have access to very few pre-determined behavioral signifiers. Computer vision can be used to extract a variety of previously unreachable aspects of behavioral analysis, including animal pose estimation and distinguishable internal states. These descriptions allow for the identification and characterization of behavioral dynamics, which determine decision making. Applying such computational approaches to mice during exploration and in the context of behaviors that have been validated to measure choice and memory can reveal dimensions of behavior that predict or even determine psychological constructs like vigilance, arousal, and memory. We are also obtaining neural signal data, which can be aligned with the behavioral signifiers.
DSI scholars would use pose estimation analysis to evaluate behavioral signifiers for choice and memory and relate it to our real time concurrent measures of neural activity and transmitter release. The students would also have the opportunity to examine the effect of disease models known to impair performance on our tasks on any identified signifier.
CANDIDATE REQUIREMENTS
Required Skills: MATLAB, Python, familiarity with statistics
Student Eligibility: Master’s, Senior, Junior, Sophomore, Freshman
International Students on F1 or J1 Student Visa: Yes, eligible
-
School: Fu Foundation School of Engineering and Applied Science
Department: Earth and Environmental Engineering
Project Overview: The Western United States is facing intensifying regional droughts and escalating wildfire risks, both of which are projected to worsen under climate change. In response, cloud seeding has gained renewed interest as a potential tool for augmenting water supplies and mitigating wildfire risks. Currently, ten states and provinces in Western North America operate cloud seeding programs. However, the overall efficacy of cloud seeding remains contentious, largely due to the challenges of distinguishing its effects from natural meteorological variability and due to concerns about the effects of pollutants on human populations. Moreover, operational strategies for cloud seeding are hampered by limitations in our fundamental understanding of cloud microphysics and the difficulty of simulating these processes under realistic atmospheric conditions.
Since 1972, the Weather Modification Reporting Act has mandated the documentation of weather modification activities in the United States. This project aims to compile and analyze these historical records to provide a comprehensive overview of weather modification efforts over the past five decades. The study will utilize large language models (LLMs) to extract and synthesize key information from reports, creating a unique dataset that tracks the prevalence and context of weather modification technologies. This dataset will be cross-referenced with historical climate data to examine the meteorological conditions under which cloud seeding has been deployed, offering insights into its potential efficacy.
To further refine the analysis, an Invariant Causal Prediction Framework will be employed to identify consistent patterns in the use of weather modification technology in relation to climatic drivers. By integrating historical records, climate data, and causal inference methods, this project will provide a nuanced understanding of the role weather modification has played in managing water resources and mitigating climate risks in the Western United States.
DSI Scholar Responsibilities include:
1. Compile historical records on the usage of Weather Modification in the US since 1972.
2. Use LLM’s to synthesize information from 1026 past historical records of Weather Modification Usage to develop a data set of Weather Modification usage including the locations, dates, materials, and purpose of Weather Modification Activities.
3. Cross-reference locations and dates with historical climate data sets to understand the context under which Weather Modification has previously been used.
4. Investigate an Invariant Causal Prediction framework to identify consistent patterns in the use of Weather Modification Technology.
CANDIDATE REQUIREMENTS
Required Skills:
– Fluency in Python or R (preference for python)
– Experience with LLM’s
Student Eligibility: Master’s, Senior, Junior
International Students on F1 or J1 Student Visa: Yes, eligible
-
School: Fu Foundation School of Engineering and Applied Science
Department: Department of Industrial Engineering and Operations Research
Project Overview: The objective of this project is to develop a comprehensive framework for preemptively assessing the safety of self-driving cars in a new urban environment prior to their deployment. The project will start from database construction, by processing heterogeneous raw data from police reports in natural language to street views in satellite images. Based upon that, we will develop innovative transfer learning methods for the evaluation of existing driving algorithms, and construct a traffic simulator to analyze future algorithms. Leveraging counterfactual analysis, we aim to inform the regulatory decisions surrounding the introduction of self-driving cars. Additionally, we will explore post-entry safety assessment mechanisms for ongoing monitoring and improvement.
The DSI Scholar will:
– Integrate multiple datasets from different sources into a database for convenient query.
– Preprocess data using large language models and image recognition techniques.
– Build risk models to estimate the accident rate of autonomous driving vehicles in several environments.
CANDIDATE REQUIREMENTS
Required Skills: Proficiency in Python, especially its common machine learning libraries. Experience with natural language processing.
Student Eligibility: Master’s, Senior, Junior
International Students on F1 or J1 Student Visa: Yes, eligible
-
School: Columbia Business School
Department: Finance
Project Overview: This project aims to explore the implications of U.S. labor mobility on employers’ decisions regarding health insurance plans for their employees. According to Census data, employers cover over 86% of insurance plans for private employees in the U.S. These employer-based insurance plans are vital for the well-being of employees and their families, yet they also represent a significant operational cost for companies. As labor mobility and job turnover increase, employers may face reduced incentives to offer robust risk-sharing in health plans, as the benefits of investing in healthier workers can be easily lost to competitor firms through poaching. This dynamic is contributing to the rise in high-deductible health plans, which shift more cost and risk onto employees.
The project leverages a comprehensive regulatory index, which I have already collected and constructed, detailing how each state regulates and enforces non-compete agreements in labor contracts. This index will help us understand how changes in state regulations, which influence labor mobility, ultimately affect the health insurance coverage provided by employers.
The project utilizes raw data from IRS Form 5500s, which provides insights into firms’ insurance choice decisions, and a proprietary commercial insurance claim dataset covering more than 40 million enrollees in employer-based plans, offering granular data on individual enrollment and health expenditures. Our goal is to establish causal evidence that links changes in labor mobility, induced by state policies, to firms’ insurance supply decisions and individual medical utilization. To achieve this, we will employ statistical methods such as econometrics and causal inference. Additionally, the DSI Scholar will be tasked with applying Natural Language Processing (NLP) algorithms to process and analyze the raw Form 5500 data, ensuring that we extract meaningful insights to inform our study.
The DSI Scholar will play a crucial role in the successful execution of this project, focusing on data processing, analysis, and methodological application. Their responsibilities will include:
1. Pre-processing the IRS data
a) Utilize Natural Language Processing (NLP) techniques to extract and structure relevant information from raw IRS Form 5500 data. This will involve parsing text data to identify and categorize insurance plan details and related variables.
b) Link the cleaned Form 5500 data to external databases such as S&P Compustat
2. Analysis of Commercial Insurance Plan Data:
a) Identify the insurance plans utilized by individual enrollees in the commercial insurance claim dataset. Explore the features of these insurance plans, such as deductibles, co-pays, and coverage options, and investigate any possible plan switches among enrollees over time.
b) Analyze medical expenditures for inpatient and outpatient visits, identifying trends and patterns that may inform the broader study on labor mobility and insurance choices.
3. Data Analysis
a) Conduct exploratory data analysis to uncover initial patterns, trends, and insights within the datasets. The DSI Scholar will generate descriptive statistics and visualizations to provide a clear understanding of the data landscape.
b) Support the application of econometric models and causal inference techniques to assess the impact of labor mobility on employer health insurance decisions. This may include running regression analyses, propensity score matching, or instrumental variable approaches under supervision.
4. Documentation and Reporting
CANDIDATE REQUIREMENTS
Required Skills:
– Proficiency in Python for data processing and analysis with skills in Natural Language Processing (NLP) for text data extraction and analysis.
– Strong foundation in econometrics and causal inference techniques with experience with longitudinal data analysis. Proficiency in at least one of the following statistical software: Stata, R, and SAS.
– Effective communication skills for documenting processes and presenting results.
Student Eligibility: Master’s, Senior, Junior, Sophomore
International Students on F1 or J1 Student Visa: Yes, eligible
-
School: Vagelos College of Physicians & Surgeons
Department: Systems Biology
Project Overview: Background: We have recently developed ‘SCRuB’, a machine learning model that removes contamination from microbiome samples by analyzing datasets of collected DNA to infer their true microbial components (Austin et al., Nat Biotechnology 2023). We showed that this method, through a unique expectation maximization framework, improves the power of microbiome research, allowing for stronger clinical applications ranging from cancer to preterm birth. Despite SCRuB’s success, we know there is opportunity for more improvements.
Project: The aim of this project is to extend our existing SCRuB method by incorporating even more biological structures into its expectation maximization model. While the original method effectively incorporates microbiome compositions, the method would improve by developing statistical frameworks that would allow it to utilize other biological data commonly available in microbiome research.
This project will be conducted in three steps
1. designing the machine learning methodology that allows SCRuB to effectively use additional biological data points;
2. implementing the software using a programming language of your choice;
3. evaluating how your implementation can strengthen the power of a microbiome analysis.
All steps will involve close collaboration with members of the lab. Upon the successful culmination of the project, the student will be encouraged to publish their findings as a peer-reviewed manuscript as well as present at a scientific conference.
CANDIDATE REQUIREMENTS
Required Skills: Preferred backgrounds would include: Data Science; Machine Learning; Expectation Maximization; Python, R or other languages used for data science. Experience with biology is not a strict requirement.
Student Eligibility: Master’s, Senior, Junior
International Students on F1 or J1 Student Visa: Yes, eligible
-
School: CUIMC
Department: Psychiatry
Project Overview: Cognitive flexibility is an executive function that is necessary to flexibly adapt previously learned behaviors to changing environmental demands. This cognitive function enables an individual to “look at things differently” and to adapt to one’s environment, instead of engaging in perseverative thinking that can lead to rumination and mental rigidity. However, the neurobiological mechanisms underlying cognitive flexibility in healthy and disease-relevant conditions are largely unknown.
The goal of this project is to employ state-of-the-art machine learning analysis of mouse neuroimaging and behavior data to understand cellular and neural circuit mechanisms regulating cognitive flexibility. In our experiments, real-time single cell neural activity data was recorded with head-mounted miniature microscopes from a large population of neurons in freely moving mice while they were trained to perform a complex decision-making task. In this task, mice had to learn that a set of features (odor, texture, and location) was associated with a hidden food reward. Upon learning the initial feature-reward association over 30 trials, the reward predicting features were changed and the mice had to learn that a different set of features was now associated with reward. Using machine learning techniques, some of which were developed in our lab, we want to understand how neural representations of feature-reward associations emerge in the brain and how the dynamic evolution of these representations during trial-and-error experience impact decision-making behavior.
This exciting data science project utilizes highly innovative in vivo Ca2+ imaging data sets of neural activity from freely behaving mice with and without in vivo neural circuit manipulations, providing students the opportunity to apply computational analyses techniques to provide unprecedented insight into how the brain controls behavior.
The student will work closely with other Ph.D. students and postdocs who will provide hands-on training, and will be mentored by the PI through regular meetings. The main analyses techniques include Representation Similarity Analyses (RSA) to determine how the brain represents information by comparing the similarity of neural response patterns across different stimuli or conditions. Representation Evolution Analyses (REA) utilize support vector machines (SVMs) and linear classifiers to determine trial-based neural and behavioral response patterns, followed by cosine similarity analyses to determine changes in the neural coding axis over the course of learning and reversal learning. All analyses pipelines and scripts are available in the lab for the student to use.
The complexity of the data set will provide ample opportunities for the student to learn, develop, and apply different types of computational analyses. We have clear hypotheses for the student to test with our established analysis pipeline. In addition, the complexity of the data also provides opportunities for the student to develop their own new questions to ask from the data set. The data set is clearly defined so the student can get started on the analyses without delay. The student will acquire mentorship from a lab with leading experience in analyzing scientific data and a successful track record in supervising students.
CANDIDATE REQUIREMENTS
Required Skills:
Some experience with Python programming, especially with the sklearn library would be beneficial. Linear classifiers and representational similarity analysis (RSA) are the main tools we use, and we have a pipeline for a new analysis that was developed in our lab, which we named “Representational Evolution Analysis (REA)”. This analysis leverages support vector machines (SVMs) and linear classifiers to determine how neural representations dynamically evolve as a function of learning, and how they flexibly adapt during reversal learning/ cognitive flexibility. Training on this new analysis pipeline will be provided through hands-on training by the student mentor.
Student Eligibility: Master’s, Senior
International Students on F1 or J1 Student Visa: Yes, eligible
-
School: Arts & Sciences
Department: Department of Earth & Environmental Sciences
Project Overview: Climate and oceanographic observations provide us with a valuable view of a changing world, yet they are limited to little more than a single human lifespan. In order to consider these observations in the broader context of the passage of time, proxy data from a range of archives can be utilized. These data have provided powerful evidence of abrupt climate changes in the past, and implicate the important role of the ocean in these changes. Although proxy data are immensely useful and remain the only way to assess natural variability in the climate system, they are often scattered in space and discontinuous in time, presenting a barrier to their full utilization.
This project involves the compilation and visualization of climatic and oceanographic datasets from the last time the Earth was as warm as today. These data represent important characteristics and processes, including sea-surface temperature, continental and sea-ice, ocean currents, and deep ocean carbon storage, initially recorded in deep-sea sediments and subsequently analyzed in paleoclimate laboratories around the world. The project will involve compiling these data from online repositories and other sources, and then using interpolation schemes to generate a series of visualizations in the form of maps and cross sections during different intervals (“time-slices”) through the past warm interval that will render the existing information more accessible to climate scientists, oceanographers, policy-makers and the general public.
The individual visualizations will be useful as stand-alone time-slices through the progression of a warm climatic interval that was analogous to the modern, but without the intervention of human interactions. The sequence of visualizations may also be combined into video animations that portray the previous natural evolution of the ocean and climate during a warm interval that can be compared directly to ongoing changes.
With guidance from the PI, the DSI scholar will initially compile the data from online sources. They will then assemble them spatially and temporally in order to generate maps and oceanographic cross sections. These visualizations will require the development and application of interpolation schemes to turn the scattered data into continuous views that provide a state-of-the-art estimate of oceanographic and climatic conditions from each of ten intervals of time from the previous warm interval. This is likely the main and central accomplishment of the project, although additional steps may include generating animated visualizations with interpolations through time as well as space, and the comparison of maps and ocean sections to the modern equivalents in order to evaluate the anomalies associated with human influence on the climate system. Through the course of the project, the DSI scholar will have the opportunity to interact with other members of our research group, including undergraduate and graduate students, and will have the option to spend time at the Lamont-Doherty Earth Observatory campus.
CANDIDATE REQUIREMENTS
Required Skills: Fluency in Python or data analysis packages such as MatLab will be helpful, although not required. Similarly, experience with data mining techniques may be advantageous, but will not be necessary.
Student Eligibility: Master’s, Senior, Junior, Sophomore
International Students on F1 or J1 Student Visa: Yes, eligible
-
School: Vagelos College of Physicians & Surgeons
Department: Radiation Oncology
Project Overview: Radiotherapy is a cornerstone of cancer treatment that utilizes ionizing radiation to destroy malignant cells. By accurately delineating, or segmenting, tumor target and surrounding organ-at-risks (OARs), the treatment planning process will guide the treatment machine to deliver precise radiation dose to tumor while sparing healthy surrounding tissues to minimize side effects. Despite the advances of end-to-end deep learning models in automated medical image segmentation, due to the inherent challenges in cone beam computed tomography (CBCT) such as low soft tissue contrast and limited image quality, the current fully automated segmentation methods usually fail to consistently achieve satisfactory results for clinical use. As a result, their outputs may require significant manual adjustments, which has become a bottleneck in time-sensitive practices such as online adaptive radiotherapy (oART).
The main purpose of this DSI scholar project is to develop an artificial intelligence (AI)-driven interactive tool for image segmentation in adaptive radiotherapy using visual prompted foundation models and reinforcement learning. This tool will be developed through two special aims: 1) Development of a web-based interface: Create a user-friendly web interface that accepts user inputs – such as clicks, scribbles and bounding boxes – to guide the interactive segmentation process. The segmentation will be powered by visual prompt-based foundation models that are adapted for CT images. 2) Optimization of interactive contour refinement. Optimize the dynamic process of contour refinement through reinforcement learning, aiming to achieve desired segmentation with the fewest possible iterations.
We have identified the roadmap to expand our web-based automated image segmentation system to an interactive tool. Our desired end goal is that this interactive tool can significantly shorten oART treatment time. This will reduce the risk of patient movement during treatment, offering potentially more effective treatment options for cancer patients.
The DSI Scholar will:
1. Implement a web-based interface that takes user’s inputs (prompt) to guide interactive segmentation.
2. Finetune Segment Anything Model (SAM) 2 using LoRA for 3D CT images, and make it work on the system developed in 1.
3. Assist in investigating approaches to optimize the dynamic process of contour refinement with the initial results obtained from automated segmentation algorithms for efficient and effective contour refinement.
4. Present results to the group and prepare for potential publication or further development.
CANDIDATE REQUIREMENTS
Required Skills:
1. Familiarity with web programming using JavaScript/HTML/CSS and WebGL
2. Fluency in Python and PyTorch
3.Experience with medical image analysis using packages such as ITK and MONAI
4. Experience with reinforcement learning is desired
Student Eligibility: Master’s, Senior
International Students on F1 or J1 Student Visa: Yes, eligible
-
School: Columbia Climate School
Department: Lamont Doherty Earth Observatory
Project Overview: Phytoplankton are tiny photosynthetic organisms that live in the sunlit areas of oceans and freshwater bodies. They play a crucial role in converting CO2 dissolved in water into organic compounds that sustain nearly all marine life, while producing over half of the oxygen in our atmosphere. Due to their ability to fix CO2, phytoplankton are vital for understanding carbon sequestration, climate regulation, and supporting fisheries. With around 5,000 species, studying them is essential for monitoring the health of aquatic ecosystems and life on Earth.
Traditionally, microscopy has been used to study phytoplankton, but it is slow, costly, and labor-intensive. While newer imaging technologies have sped up this process, they still require manual handling and expert classification. At Lamont-Doherty Earth Observatory, we modified a commercially available imaging system to automate the imaging of particles and plankton in water samples. This system can continuously capture phytoplankton images while a ship is moving, allowing data collection across large areas and over time. In the last two years, we have field-tested this system, amassing millions of images from oceans, coastal areas, and rivers.
However, the slow manual classification process is still a challenge. Our goal is to overcome this by developing a Computer-Assisted Automated Phytoplankton Classification System (CAPCS) using advanced computer vision and deep learning techniques. This will enable rapid, accurate identification of phytoplankton species based on unique features, transforming data collection.
This innovation is critical for NASA’s hyperspectral ocean color sensors, like PACE, EMIT, GLIMR, which aim to detect major phytoplankton groups from space. Overcoming these challenges will revolutionize water quality, marine pollution, climate change, and fisheries sciences, meeting the growing demand for high-resolution data from both field and satellite observations.
DSI Scholar Responsibilities
1. Develop AI Models:
– Design and implement deep learning models for phytoplankton image classification.
– Apply computer vision techniques to improve accuracy and efficiency.
2. Data Management:
– Clean and preprocess large phytoplankton datasets.
– Use data augmentation to enhance model robustness.
3. Optimize Algorithms:
– Test and refine AI algorithms to address limitations and improve performance.
– Stay updated with advancements in AI and machine learning.
4. Collaborate Interdisciplinary:
– Work with Goes and other researchers to integrate AI with ecological and environmental sciences.
– Bridge computer science, statistics, and environmental science in research efforts.
5. Evaluate Models:
– Assess model performance through rigorous validation and cross-validation.
– Ensure accuracy and robustness of AI solutions.
6. Documentation and Reporting:
– Document methodologies and results thoroughly.
– Prepare reports and presentations indicating progress of the work
CANDIDATE REQUIREMENTS
Required Skills:
1. Machine Learning & Deep Learning:
– Proficiency in implementing machine learning algorithms, especially Convolutional Neural Networks (CNNs), and advanced deep learning methods like Recurrent Neural Networks (RNNs) and Transformers.
2. Computer Vision:
– Strong understanding of image processing, object detection, and segmentation for analyzing phytoplankton and microplastic images.
3. Feature Selection & Dimensionality Reduction:
– Knowledge of methods to manage and optimize high-dimensional data.
4. Statistical Analysis:
– Foundation in statistical methods, including spatial statistics, for robust data interpretation.
5. Programming Skills:
– Proficiency in Python or R
6. Model Evaluation & Optimization:
– Skills in evaluating and optimizing machine learning models for enhanced performance.
Student Eligibility: Master’s, Senior
International Students on F1 or J1 Student Visa: Yes, eligible
-
School: Columbia Climate School
Department: Lamont Doherty Earth Observatory
Funding Note: This is a grant funded project. Exact amount of funding will depend on hours completed.
Project Overview: Ocean color remote sensing has long been used to map phytoplankton functional types (PFTs) in the upper ocean, traditionally relying on the ratios of photosynthetic pigments chlorophyll-a and accessory (non-photosynthetic) pigments like chlorophyll-b and carotenoids. However, these methods often fall short in distinguishing complex PFT compositions due to overlapping pigment absorption peaks and the limited spectral resolution of traditional multi-spectral ocean color sensors.
The advent of hyperspectral remote sensing, notably through NASA’s PACE mission and the upcoming GLIMR mission , offers continuous spectral coverage from the ultraviolet to near-infrared wavelengths, significantly enhancing the ability to differentiate between various phytoplankton pigments. Hyperspectral data capture detailed spectral features, which are critical for accurate pigment identification and PFT classification that is important for fisheries, carbon sequestration and climate change studies.
Recent advancements incorporate Artificial Intelligence (AI) techniques such as Linear spectral unmixing, Independent Component Analysis, Gaussian Mixture Models, Finite Mixture of Skewed Components (FMSC), etc., to overcome limitations of traditional algorithms. Traditional methods decompose pigment absorption spectra into Gaussian components, but these often face challenges with overlapping absorption peaks and limited spectral resolution. The FMSC algorithm, however, encodes spectral shapes in a finite metric space, providing a more nuanced representation of spectral data and improving the accuracy of pigment retrieval.
This study will utilize HPLC pigment data obtained from the field and use hyperspectral ocean color data for:
1. Developing AI and other complex statistical methods to improve the accuracy of distinguishing between complex mixtures of pigments.
2. Use field pigment datasets to evaluate the performance of various algorithms against conventional spectral decomposition techniques.
3. Apply the algorithms developed to satellite data for improved global monitoring and analysis of PFTs from space.
The Scholar will:
1. Data Acquisition: Assist in obtaining and managing field hyperspectral optical, and HPLC pigment data from NASA SEABASS database for field ocean color data.
2. Data Cleaning and Preprocessing: Prepare hyperspectral datasets for analysis by removing noise and normalizing data.
3. Algorithm Development and Implementation: AI Algorithm Development: Implement and train AI models for pigment retrieval using hyperspectral data. This involves coding, testing, and optimizing machine learning algorithms.
4. Algorithm Integration, Analysis and Validation: Apply and refine various statistical methods to hyperspectral data sets for accurate pigment and PFT identification. Analyze spectral data to extract relevant features and validate the accuracy of the FMSC and AI algorithms: Compare the performance of various AI approach with traditional methods Gaussian Curve fitting methods
5. Data Interpretation and Reporting: Translate algorithmic pigment outputs into meaningful insights about phytoplankton communities and their spatial distributions.
6. Data Management and Documentation: Refine code and prepare detailed workflow for testing by other ocean color scientists. Prepare reports, be willing to give a presentation at a NASA meeting and be part of publications.
7. Application of the algorithms to Satellite fields of hyperspectral ocean color data from PACE to generate regional and global maps of PFTs.
CANDIDATE REQUIREMENTS
Required Skills:
– Fluency in R and or Python and experience in working with big data files in particular ncdf format files.
– Capable of querying databases, extracting and pairing of datasets for algorithm development and algorithm performance evaluation.
– Knowledge of the use of AI based statistical approaches for extracting pigment information from hyperspectral datasets.
– Using these algorithms for mapping phytoplankton functional types from satellite data.
Student Eligibility: Master’s, Senior
International Students on F1 or J1 Student Visa: Yes, eligible
-
School: Fu Foundation School of Engineering and Applied Science
Department: Computer Science
Project Overview: Existing quantum platforms, such as IBM’s Qiskit allow off-site users to access the platforms. Long-term, we are interested in using minor architecture discrepancies to identify the specific machine a computation has been performed on (much like fingerprinting, PuFs for classical systems). More specifically, current quantum hardware requires high levels of error correction techniques to maintain the states of computation. Our approach is pick simple computations, run them on the various machines and observe statistical differences in the syndromes used to indicate how to perform error correction. Currently, we are using support vector machines to perform the inference, but would like to consider alternative classification strategies.
CANDIDATE REQUIREMENTS
Required Skills: Familiarity with various ML methods (SVMs, Decision trees, neural nets, transformers) and/or familiarity with interfaces to systems/packages that can apply these methods to collect results.
Student Eligibility: Master’s, Senior, Junior
International Students on F1 or J1 Student Visa: Yes, eligible
-
School: Arts & Sciences
Department: Earth and Environmental Sciences
Project Overview: The ocean carbon sink accounts for roughly 25% of annual anthropogenic CO2 emissions. To understand past changes, diagnose ongoing changes, and to predict the future behavior of the ocean carbon sink, we must understand its spatial and temporal variability. To monitor this key climate service, in particular, air-sea CO2 fluxes across the globe are needed for us to monitor year to year changes. However, the ocean is poorly sampled, and the sparsity of measurement in space and time makes the estimates of such fluxes challenging. In the McKinley group, we have developed several Machine Learning (ML) techniques to reconstruct the ocean carbon field based on association to satellite-based full-field driver data. These machine learning algorithms interpolate sparse surface ocean pCO2 observations to global coverage.
Understanding the value of different data sources to these ML algorithms is an active area of ML research. The spatio-temporal nature of the observed data makes it difficult to understand the impact of specific observations on the performance of the ML estimation. This DSI Scholar will develop approaches to quantify the contribution of individual pCO2 observations to ML interpolation algorithms using Explainable ML methods.
More specifically, with the Data Shapley framework (Ghorbani and Zou, 2019), we plan to assign a specific value, or score, to each data point in the available database. We will also quantitatively evaluate how alternative sampling patterns would change algorithmic skill. To do this, we will use a multi-model, multi-ensemble ‘testbed’, as we have in a range of previous studies.
In Spring 2025, the DSI Scholar will begin by learning about methods and data needed for this project, such as the pCO2-Residual product (Bennington et al. 2022, JAMES, doi:10.1029/2021MS002960) and output from Earth System Models (ESMs) which are used for the testbed. They will review existing code and help develop improved workflows with a strong focus on data sharing and reproducibility. They will then work with us to begin to implement the Data Shapley for data valuation. The student will also contribute to analysis of the reconstructed ocean carbon field and be included in publications resulting from this work.
CANDIDATE REQUIREMENTS
Required Skills: Strong Python and ML skills are required – please discuss both in your application.
Student Eligibility: Master’s
International Students on F1 or J1 Student Visa: Yes, eligible
-
School: School of International and Public Affairs/Arts & Sciences
Department: Economics
Project Overview: We have a number of projects looking at the diffusion of legal ideas in the United States and Canada. One project involves learning the structure of citations and the diffusion of ideas in the US federal judiciary. We have the text of all Federal cases back to 1800, along with the network of citations cases make to each other. We want to look at “”breakthrough”” federal cases that replace all future citations to the things they cite, or which have embedding distance far from the things they cite, but close to the cases that cite them. Then we will use this to rank influential legal cases in US history, and we will ask our GPT-4o based summarizer to translate them into accessible language.
The other project uses an existing corpus of collective bargaining contracts in Canada, and the DSI scholar will scrape the universe of judicial labor arbitration cases. The idea is that the language and concepts articulated in judicial decisions will diffuse into the text of collective bargaining agreements, as lawyers coordinate the judicial language. We will look at embedding distances between contracts and arbitration opinions.
The DSI scholar will a) process the judicial opinions dataset and implement the two breakthrough measures, and b) scrape text data from the CANLII database. We will use s-bert to measure embeddings.
CANDIDATE REQUIREMENTS
Required Skills: Python, and specifically expertise with networks and embeddings would be helpful.
Student Eligibility: Master’s, Senior
International Students on F1 or J1 Student Visa: Yes, eligible
-
School: Climate SchoolDepartment: Seismology, Geology and TectonophysicsProject Overview: Volcanology has transformed into a highly data-driven and computationally focused field. Numerous computational models have been developed to simulate various physical phenomena during volcanic eruptions. A critical component of forecasting the behavior of volcanic eruptions is assessing the probability of different outcomes and scenarios. For this, scientists implement probabilistic modeling approaches, which provide such assessments. The DSI scholar who will join this project will be tasked with creating workflows that perform simulations and generate hazard maps with quantitative probability assessment. The workflows will be created using Jupyter Notebooks and be based on a range of eruption simulation tools written in different languages. The workflows will utilize probabilistic tools such as Markov Chain Monte Carlo (MCMC) or the Ensemble Kalman Filter (EnKF). The objective of this project is to provide a comprehensive and flexible tool for members of the volcanology community to utilize. In the long term, this tool could be expanded to use in any geophysical process that requires uncertainty quantification.
CANDIDATE REQUIREMENTS
- Required Skills:
- – Fluency in Python is required.
- – Familiarity with statistical methods and notation is highly preferred.
- – We expect the scholar to be comfortable with reading academic papers or other high-level readings to familiarize themselves with the concepts.
- – Other scripting skills and programming language knowledge will be beneficial to work as well.
Student Eligibility: Master’s, Senior, Juniors only.International Students on F1 or J1 Student Visa: Yes, eligible