Curriculum

Innovative and Cutting-Edge Curriculum

Designed with both theoretical foundations and practical applications, our data science courses reflect the latest trends and technologies in data science such as machine learning, natural language processing, applied deep learning, and many more courses at the frontiers of data science taught by world-class Columbia faculty.

Program Structure

The M.S. in Data Science requires students to complete 21 credits of core coursework and a minimum of 9 elective credits, providing both depth and breadth across key data science disciplines.

Core Courses

Core courses build a strong foundation in algorithms, statistical inference, machine learning, data analysis, and scalable data systems.

Please Note: Students with prior academic preparation in specific core areas may be eligible to waive or test out of certain core courses, allowing them to take additional electives. Waivers are reviewed individually, based on previous coursework and instructor approval.

Electives

Electives allow students to explore advanced or specialized topics and pursue interdisciplinary interests across the university. In addition to Data Science Institute (DSI) courses, students are encouraged to take approved electives in departments such as Computer Science, Statistics, Engineering, Business, Public Health, Economics, and more.

Academic advisors work closely with students prior to registration to determine course relevance, eligibility, and fit (4000-level or above; letter-graded).

Capstone Project

The Capstone serves as the culminating academic experience of the program. In this semester-long, mentored project, students apply data science methods to solve complex, real-world problems in collaboration with faculty or industry partners.
Learn more about Capstone Projects.

Computer Science

Prerequisites: Students are expected to have solid programming experience in Python or with an equivalent programming language. This class is intended to be accessible for students who do not necessarily have a background in databases, operating systems or distributed systems. The goal of this class is to provide data scientists and engineers that work with big data a better understanding of the foundations of how the systems they will be using are built. It will also give them a better understanding of the real-world performance, availability and scalability challenges when using and deploying these systems at scale. In the course we will cover foundational ideas in designing these systems, while focusing on specific popular systems that students are likely to encounter at work or when doing research.

Spring Semester: 3 credits
COMS 4721 is a graduate-level introduction to machine learning. The course covers basic statistical principles of supervised machine learning, as well as some common algorithmic paradigms. Additional topics, such as representation learning and online learning, may be covered if time permits.

Spring Semester: 3 credits
Prerequisites: basic knowledge in programming (e.g., at the level of COMS W1007), a basic grounding in calculus and linear algebra.

Methods for organizing data, e.g. hashing, trees, queues, lists, priority queues. Streaming algorithms for computing statistics on the data. Sorting and searching. Basic graph models and algorithms for searching, shortest paths, and matching. Dynamic programming. Linear and convex programming. Floating point arithmetic, stability of numerical algorithms, Eigenvalues, singular values, PCA, gradient descent, stochastic gradient descent, and block coordinate descent. Conjugate gradient, Newton and quasi-Newton methods. Large scale applications from signal processing, collaborative filtering, recommendations systems, etc.

Fall Semester: 3 credits

Engineering

Prerequisites: CSOR W4246 Algorithms for Data Science, STAT W4105 Probability, COMS W4121 Computer Systems for Data Science, or equivalent as approved by faculty advisor. Co-requisites: to be completed alongside or after: STAT W4702 Statistical Inference and Modeling, COMS W4721 Machine Learning for Data Science, STAT W4701 Exploratory Data Analysis and Visualization, or equivalent as approved by faculty advisor.

This course provides a unique opportunity for students in the M.S. in Data Science program to apply their knowledge of the foundations, theory and methods of data science to address data science problems in industry, government and the non-profit sector. The course activities focus on a semester-length data science project sponsored by a faculty member or local organization. The project synthesizes the statistical, computational, engineering challenges and social issues involved in solving complex real-world problems.

Fall and Spring Semesters: 3 credits

Statistics

Prerequisite: Calculus.

This course covers the following topics: Fundamentals of probability theory and statistical inference used in data science; Probabilistic models, random variables, useful distributions, expectations, law of large numbers, central limit theorem; Statistical inference; point and confidence interval estimation, hypothesis tests, linear regression.

Fall Semester: 3 credits
Prerequisite: Programming, fundamentals of data visualization, layered grammar of graphics, perception of discrete and continuous variables, introduction to Mondran, mosaic pots, parallel coordinate plots, introduction to ggobi, linked pots, brushing, dynamic graphics, model visualization, clustering and classification.

Fall Semester: 3 credits
Prerequisites: Working knowledge of calculus and linear algebra (vectors and matrices) and STAT GR5203 or equivalent.

Course covers fundamentals of statistical inference and testing, and gives an introduction to statistical modeling. The first half of the course will be focused on inference and testing, covering topics such as maximum likelihood estimates, hypothesis testing, likelihood ratio test, Bayesian inference, etc. The second half of the course will provide introduction to statistical modeling via introductory lectures on linear regression models, generalized linear regression models, nonparametric regression, and statistical computing. Throughout the course, real-data examples will be used in lecture discussion and homework problems.

Fall and Spring: 3 credits

Electives

This course introduces students to real-world applications of data science in the banking and insurance industries. Using datasets such as loan portfolios, credit card transactions, and FEMA flood claims, students learn to model risk, detect fraud, and evaluate the impact of climate on financial services. Hands-on work with Python, R, and industry tools is combined with discussion of ethics, regulation, and emerging trends in FinTech and InsurTech, preparing students to apply data science to complex challenges in financial services.
This course explores how Large Language Models (LLMs) are transforming business strategy, operations, and innovation. Students will learn to apply LLMs to market research, customer engagement, content generation, data analysis, and entrepreneurship through hands-on workshops and case studies. Emphasizing both technical applications and ethical considerations, the course equips students to design and implement AI-driven solutions to real-world business challenges. A capstone project allows students to develop and present a unique LLM-based business application by the end of the semester.
Computational approaches to the analysis, understanding, and generation of natural language text at scale. Emphasis on machine learning techniques for NLP, including deep learning and large language models. Applications may include information extraction, sentiment analysis, question answering, summarization, machine translation, and conversational AI. Discussion of datasets, benchmarking and evaluation, interpretability, and ethical considerations.
One of the key ingredients to the current success of AI is the ability to perform computations on vast amounts of training data. Today, applying HPC techniques to AI algorithms is a fundamental driver for the progress of Artificial Intelligence. In this course, you will learn HPC techniques typically applied to supercomputing software and how they are applied to obtain the maximum performance from AI algorithms. You will also learn about techniques for building efficient AI systems. This is especially becoming more critical in the era of large foundation models such as GPT and LLAMA that require massive amounts of computational power and energy.
This applied Natural Language Processing course will focus on computational methods for extracting social and interactional meaning from large volumes of text and speech (both traditional media and social media). Topics will include:
- Sentiment Analysis: automatic detection of people’s sentiment towards a topic, event, product, or persons. Practical applications in various domains will be discussed (e.g., predicting stock market prices, or presidential elections)
- Emotion and Mood Analysis: automatic detection of people’s emotions (angry, sad, happy) by analyzing various media such as books, emails, lyrics, online discussion forums. Practical applications in various domains (such as predicting depression, categorization of songs)
- Belief Analysis and Hedging: automatic detection of people’s beliefs (committed belief and non-committed beliefs) from social media. Analysis of the use of hedging as a communicative device in various media: online discussions, scientific writing or legal discussions.
- Deception Detection (e.g., detecting fake reviews online, or deceptive speech in court proceedings)
- Argumentation Mining: automatic detection of arguments from text, such as online discussion or persuasive essays. Practical application for various domains (e.g., political, legal or education (e.g., improving students’ skills in writing persuasive essays)
- Social Power: automatic detection of power structure in organizations by analyzing people’s communications such as emails.
- Extracting Social Networks from text, such as networks of characters from novels, or networks from social media (e.g., people holding particular opinions, or network of friends).
- Personality and Interpersonal Stance
The world is full of noise and uncertainty. To make sense of it, we collect data and ask questions. Is there a tumor in this x-ray scan? What affects the quality of my manufacturing plant? How old is this planet I see through the telescope? Does this drug actually work? To pose and answer such questions, data scientists must iterate through a cycle: probabilistically model a system, infer hidden patterns from data, and evaluate how well our model describes reality. By the end of this course, you will learn how to use probabilistic programming to effectively iterate through this cycle. Specifically, you will master modeling real-world phenomena using probability models, using advanced algorithms to infer hidden patterns from data, and evaluating the effectiveness of your analysis. You will learn to use (and perhaps even contribute to) Edward throughout this course.
This class offers a hands-on approach to machine learning and data science. The class discusses the application of machine learning methods like SVMs, Random Forests, Gradient Boosting and neural networks on real world dataset, including data preparation, model selection and evaluation. This class complements COMS W4721 in that it relies entirely on available open source implementations in scikit-learn and tensor flow for all implementations. Apart from applying models, we will also discuss software development tools and practices relevant to productionizing machine learning models.
This course provides a practical, hands-on introduction to Deep Learning. We aim to help students understand the fundamentals of neural networks (DNNs, CNNs, and RNNs), and prepare students to successfully apply them in practice. This course will be taught using open-source software, including TensorFlow 2.0. In addition to covering the fundamental methods, we will discuss the rapidly developing space of frameworks and applications, including deep learning on the web. This course includes an emphasis on fairness and testing, and teaches best practices with these in mind.
Data scientists often have to answer questions that will lead to decisions about actions a company might take. Often, they will be able to run an experiment, and see the effect the decision might have by testing it first. Other times, they will only have observational data at their disposal. In both cases, they need to infer the causal effect of an action on some outcomes of interest. Causal inference is an essential skill for a data scientist. Without a proper understanding, potential biases as large as 1000% have been observed in practice! This course will cover the basics of the potential outcomes framework, the Pearlian framework, and a collection of methods for observational and experimental causal inference. We’ll use examples from industry applications throughout the course, especially focused on web applications.
This course is designed as an introduction to elements that constitutes the skill set of a data scientist. The course will focus on the utility of these elements in common tasks of a data scientist, rather than their theoretical formulation and properties. The course provides a foundation of basic theory and methodology with applied examples to analyze large engineering, business, and social data for data science problems. Hands-on experiments with R or Python will be emphasized.
The vast proliferation of data and increasing technological complexities continue to transform the way industries operate and compete. Over the last two years, 90 percent of the data in the world has been created as a result of the creation of 2.5 quintillion bytes of data on a daily basis. Commonly referred to as big data, this rapid growth and storage creates opportunities for collection, processing and analysis of structured and unstructured data. Financial services, in particular, have widely adopted big data analytics to inform better investment decisions with consistent returns. In conjunction with big data, algorithmic trading uses vast historical data with complex mathematical models to maximize portfolio returns. The continued adoption of big data will inevitably transform the landscape of financial services. However, along with its apparent benefits, significant challenges remain in regards to big data’s ability to capture the mounting volume of data. The increasing volume of market data poses a big challenge for financial institutions. Along with vast historical data, banking and capital markets need to actively manage ticker data. Likewise, investment banks and asset management firms use voluminous data to make sound investment decisions. Insurance and retirement firms can access past policy and claims information for active risk management. The course will be a mix of Theory and practice with real big data cases in finance. We will invite guest lecturers mostly for real Big Data Finance Applications. We will give MATLAB, R, or Python examples.
The course focuses on translating technical expertise into work-place solutions by teaching students to: (1) identify relevant shortfalls in traditional processes; (2) precisely match datasets and machine learning features to overcome these shortfalls; (3) narrowly define value to fit work place processes, analytical framework, and bottom line. Each class will be structured as an actual end-to-end work-place project and use concrete examples to teach students to design, build and deliver solutions that integrate these considerations. A combination of assignments, presentation, and research paper will be sued to evaluation students’ progress in bridging technical and applied solutions with evaluation criteria matching those of a work-place project.
Computer vision is a subfield of artificial intelligence (AI) and computer science, which enables machines to interpret and understand visual data such as digital images, videos, and other visual inputs. This course will discuss how machine learning algorithms including deep learning are used in the field of computer vision, which will cover a range of topics including biometrics (e.g., iris and face recognition), medical image analysis (e.g., rheumatoid arthritis and brain images), object recognition, handwritten digit, and street view house number recognition. Students will learn basic knowledge of image processing and analysis, then how to apply machine learning methods, including deep learning, to the field of computer vision. This course aims to prepare students with basic knowledge and skills to explore opportunities using machine learning in the field of computer vision.
This applied course bridges statistical theory and real-world experimentation in the technology industry. Students learn how large-scale online platforms such as Uber, Reddit, and Google design, implement, and interpret experiments to drive data-informed decisions. Focusing on the practical application of causal science, the course explores key topics including: Principles of experimental design and diagnostics; methods to improve experiment sensitivity and reliability; techniques for measuring long-term and heterogeneous treatment effects; applications of causal inference in observational data. While this course is not coding-intensive, students are expected to have basic proficiency in Python or R and a solid foundation in probability and statistical inference.
Images are everywhere. How to deal with image data, especially with big data, is an urgent problem for data analysts. Machine learning has proven to be a powerful technology to process and analyze such big data. The course will discuss how machine learning methods are use in the field of image analysis, including biometrics (iris and face recognition), natural images (object identification/recognition), brain images (encoding and decoding), and handwritten digit recognition. Students will learn how to sue traditional machine learning methods in image data processing and analysis, and develop techniques to improve these methods. The aim of this course is to prepare students with basis knowledge and skills to explore opportunities using machine learning in the field of image analysis.