The following is a list of data science-related courses. Please refer to the Directory of Courses for the most current course offerings and information.
Cross-Registration Instructions for Non-DSI Students
Please note that DSI students have priority registration, so enrollment will be dependent on the space available after our student registration. Non-DSI students will be able to register/join a waitlist via SSOL during the change of program registration period. Please be sure to obtain your program advisor approval before enrolling. We encourage students to attend the first class to get the syllabus and to get a pulse for the course. The spring 2019 Change of Program period is Tuesday, January 22 - Friday, February 1.
Statistics & Computer Science
STCS GR5705 (formerly STAT W4242)
Introduction to Data Science
Data Science is a dynamic and fast growing field at the interface of Statistics and Computer Science. The emergence of massive datasets containing millions or even billions of observations provides the primary impetus for the field. Such datasets arise, for instance, in large-scale retailing, telecommunications, astronomy, and internet social media. This course will emphasize practical techniques for working with large-scale data. Specific topics covered will include statistical modeling and machine learning, data pipelines, programming languages, "big data" tools, and real world topics and case studies. The use of statistical and data manipulation software will be required. Course intended for non-quantitative graduate-level disciplines. This course will not count towards degree requirements for graduate programs such as Statistics, Computer Science, or Data Science. Students should inquire with their respective programs to determine eligibility of course to count towards minimum degree requirements. This course does not fulfill any major requirements for undergraduate degree programs offered by Computer Science.
Fall Semester: 3 credits
Computer Systems for Data Science
Prerequisites: Background in Computer System Organization and good working knowledge of C/C++. Corequisites: CSOR W4246 Algorithms for Data Science, STAT W4203 Probability Theory, or equivalent as approved by faculty advisor.
An introduction to computer architecture and distributed systems with an emphasis on warehouse scale computing systems. Topics will include fundamental tradeoffs in computer systems, hardware and software techniques for exploiting instruction-level parallelism, data-level parallelism and task level parallelism, scheduling, caching, prefetching, network and memory architecture, latency and throughput optimizations, specialization, and an introduction to programming data center computers.
Spring Semester: 3 credits
Machine Learning for Data Science
Daniel Hsu (Syllabus)
Prerequisites: Background in linear algebra and probability and statistics.
COMS 4721 is a graduate-level introduction to machine learning. The course covers basic statistical principles of supervised machine learning, as well as some common algorithmic paradigms. Additional topics, such as representation learning and online learning, may be covered if time permits.
Spring Semester: 3 credits
Algorithms for Data Science
Prerequisites: basic knowledge in programming (e.g., at the level of COMS W1007), a basic grounding in calculus and linear algebra.
Methods for organizing data, e.g. hashing, trees, queues, lists, priority queues. Streaming algorithms for computing statistics on the data. Sorting and searching. Basic graph models and algorithms for searching, shortest paths, and matching. Dynamic programming. Linear and convex programming. Floating point arithmetic, stability of numerical algorithms, Eigenvalues, singular values, PCA, gradient descent, stochastic gradient descent, and block coordinate descent. Conjugate gradient, Newton and quasi-Newton methods. Large scale applications from signal processing, collaborative filtering, recommendations systems, etc.
Fall Semester: 3 credits
Prerequisites: MATH V1101 Calculus I and V1102 Calculus II or the equivalent.
A calculus-based introduction to probability theory. Topics covered include random variables, conditional probability, expectation, independence, Bayes' rule, important distributions, joint distributions, moment generating functions, central limit theorem, laws of large numbers and Markov's inequality.
Probability & Statistics for Data Science
Instructor: Vince Dorie
This course covers the following topics: Fundamentals of probability theory and statistical inference used in data science; Probabilistic models, random variables, useful distributions, expectations, law of large numbers, central limit theorem; Statistical inference; point and confidence interval estimation, hypothesis tests, linear regression.
Fall Semester: 3 credits
Exploratory Data Analysis & Visualization
Instructor: Joyce Robbins
Fundamentals of data visualization, layered grammar of graphics, perception of discrete and continuous variables, introduction to Mondran, mosaic pots, parallel coordinate plots, introduction to ggobi, linked pots, brushing, dynamic graphics, model visualization, clustering and classification.
Statistical Inference & Modeling
Instructor: Marco Avella
Prerequisites: Working knowledge of calculus and linear algebra (vectors and matrices), and STAT GR5203 or equivalent.
Course covers fundamentals of statistical inference and testing, and gives an introduction to statistical modeling. The first half of the course will be focused on inference and testing, covering topics such as maximum likelihood estimates, hypothesis testing, likelihood ratio test, Bayesian inference, etc. The second half of the course will provide introduction to statistical modeling via introductory lectures on linear regression models, generalized linear regression models, nonparametric regression, and statistical computing. Throughout the course, real-data examples will be used in lecture discussion and homework problems.
MS students are encouraged to explore courses offered across the university and take advantage of the expertise in a wide range of disciplines at Columbia. Prior to registration, students receive advisement to determine if a course of interest is relevant and meets the criteria of a 4000-level or higher, technical course completed for a letter grade. You're welcome to explore the CU Directory of Classes for possible courses: http://www.columbia.edu/cu/bulletin/uwb/
Please note that many departments, including DSI, give registration priority to their students. Space permitting, courses are then opened up to students outside the department.
The following courses are examples of classes that MS students have used for elective credit. Courses offerings are dependent on faculty availabilty and may vary each semester.
Prerequisites: Familiarity with programming in either Python or R. Basic Probability.
Methods in biomedical data science (i.e. translational bioinformatics) for graduate students and upperclassmen. Students study the statistical and computational algorithms to evaluate large biomedical data, including sequence analysis, application of supervised and unsupervised machine learning, graph theoretic models and network analysis, and chemical informatics. They study how to apply these algorithms to biomedical domains in non-human genetics, human genetics, pharmacology, and public health. Successful completion of the course readies the student for graduate level research in translational bioinformatics.
Topics in Computer Science: Applied Machine Learning
This class offers a hands-on approach to machine learning and data science. The class discusses the application of machine learning methods like SVMs, Random Forests, Gradient Boosting and neural networks on real world dataset, including data preparation, model selection and evaluation. This class complements COMS W4721 in that it relies entirely on available open source implementations in scikit-learn and tensor flow for all implementations. Apart from applying models, we will also discuss software development tools and practices relevant to productionizing machine learning models.
Topics in Computer Science: Causal Inference for Data Science
Instructor Adam Kelleher (Syllabus)
Topics in Computer Science: Elements of Data Science: A First Course
Instructors: Bryan Gibson, Haiyuan Wang
This course is designed as an introduction to elements that constitutes the skill set of a data scientist. The course will focus on the utility of these elements in common tasks of a data scientist, rather than their theoretical formulation and properties. The course provides a foundation of basic theory and methodology with applied examples to analyze large engineering, business, and social data for data science problems. Hands-on experiments with R or Python will be emphasized.
Topics in Computer Science: Machine Learning Products for Data Science
Instructor Adam Kelleher
NLP: Computational Models of Social Meaning
This applied Natural Language Processing course will focus on computational methods for extracting social and interactional meaning from large volumes of text and speech (both traditional media and social media). Topics will include:
- Sentiment Analysis: automatic detection of people’s sentiment towards a topic, event, product, or persons. Practical applications in various domains will be discussed (e.g., predicting stock market prices, or presidential elections)
- Emotion and Mood Analysis: automatic detection of people’s emotions (angry, sad, happy) by analyzing various media such as books, emails, lyrics, online discussion forums. Practical applications in various domains (such as predicting depression, categorization of songs)
- Belief Analysis and Hedging: automatic detection of people’s beliefs (committed belief and non-committed beliefs) from social media. Analysis of the use of hedging as a communicative device in various media: online discussions, scientific writing or legal discussions.
- Deception Detection (e.g., detecting fake reviews online, or deceptive speech in court proceedings)
- Argumentation Mining: automatic detection of arguments from text, such as online discussion or persuasive essays. Practical application for various domains (e.g., political, legal or education (e.g., improving students’ skills in writing persuasive essays)
- Social Power: automatic detection of power structure in organizations by analyzing people’s communications such as emails.
- Extracting Social Networks from text, such as networks of characters from novels, or networks from social media (e.g., people holding particular opinions, or network of friends).
- Personality and Interpersonal Stance
Topics in Computer Science: Projects in Data Science: A First Course
Instructors: Patrick Houlihan, David Shilane
This course will introduce students to the practice of Data Science. Across many practice areas, Data Science is applied to generate knowledge and improve the quality of products and services. The form of this work can incorporate a wide variety of information sources, analytical methods, technological systems, and reporting formats. This course will provide exposure to the tools and techniques of Data Science while exploring a number of representative Practice Areas, Methods, and Software Skills
Topics in Information Processing: Big Data Analytics
With the advance of IT storage, processing, computation, and sensing technologies, Big Data has become a novel norm of life. Only until recently, computers are able to capture and analysis all sorts of large-scale data from all kinds of fields -- people, behavior, information, devices, sensors, biological signals, finance, vehicles, astronology, neurology, etc. Almost all industries are bracing into the challenge of Big Data and want to dig out valuable information to get insight to solve their challenges. This course shall provide the fundamental knowledge to equip students being able to handle those challenges. This discipline inherently involves many fields. Because of its importance and broad impact, new software and hardware tools and algorithms are quickly emerging. A data scientist needs to keep up with this ever changing trends to be able to create a state-of-the-art solution for real-world challenges.
Topics in Information Processing: Deep Learning for Computer Vision, Speech, and Language
Topics in Quantitative Finance: Big Data in Finance
Professor Miquel Noguer Alonso (Syllabus)
The vast proliferation of data and increasing technological complexities continue to transform the way industries operate and compete. Over the last two years, 90 percent of the data in the world has been created as a result of the creation of 2.5 quintillion bytes of data on a daily basis. Commonly referred to as big data, this rapid growth and storage creates opportunities for collection, processing and analysis of structured and unstructured data. Financial services, in particular, have widely adopted big data analytics to inform better investment decisions with consistent returns. In conjunction with big data, algorithmic trading uses vast historical data with complex mathematical models to maximize portfolio returns. The continued adoption of big data will inevitably transform the landscape of financial services. However, along with its apparent benefits, significant challenges remain in regards to big data’s ability to capture the mounting volume of data. The increasing volume of market data poses a big challenge for financial institutions. Along with vast historical data, banking and capital markets need to actively manage ticker data. Likewise, investment banks and asset management firms use voluminous data to make sound investment decisions. Insurance and retirement firms can access past policy and claims information for active risk management. The course will be a mix of Theory and practice with real big data cases in finance. We will invite guest lecturers mostly for real Big Data Finance Applications. We will give MATLAB, R or Python examples.
STATS GR5293, Sec 002
Topics in Modern Statistics: Applied Machine Learning for Financial Modeling and Forecasting
Instructor: Michel Leonard
The course focuses on translating technical expertise into work-place solutions by teaching students to: (1) identify relevant shortfalls in traditional processes; (2) precisely match datasets and machine learning features to overcome these shortfalls; (3) narrowly define value to fit work place processes, analytical framework, and bottom line. Each class will be structured as an actual end-to-end work-place project and use concrete examples to teach students to design, build and deliver solutions that integrate these considerations. A combination of assignments, presentation, and research paper will be sued to evaluation students' progress in bridging technical and applied solutions with evaluation criteria matching those of a work-place project.
STATS GR5293, Sec 003
Topics in Modern Statistics: Applied Machine Learning for Image Analysis
Instructor: Xiaofu He
Images are everywhere. How to deal with image data, especially with big data, is an urgent problem for data analysts. Machine learning has proven to be a powerful technology to process and analyze such big data. The course will discuss how machine learning methods are use in the field of image analysis, including biometrics (iris and face recognition), natural images (object identification/recognition), brain images (encoding and decoding), and handwritten digit recognition. Students will learn how to sue traditional machine learning methods in image data processing and analysis, and develop techniques to improve these methods. The aim of this course is to prepare students with basis knowledge and skills to explore opportunities using machine learning in the field of image analysis.
Sustainability Technology and the Evolution of Smart Cities
This course is offered through the School of Continuing Education. The progress of sustainability in recent years has almost entirely been a result in the evolution of smart, sustainable technology solutions. This course examines opportunities to drive sustainability through technology applications with the end goal of piecing together all of the pieces to envision an intelligent city. Companies are increasingly turning to technology to fulfill their sustainability goals considering many technologies provide off-the-shelf, cost-effective and immediate savings compared to operationally invasive, resource-heavy sustainability transformation programs. Sustainability technology ranges from intelligent infrastructure to mobile applications that help to drive the "sharing economy". The course will provide an overview of the sustainability technologies that large corporations are actively pursuing and delve into the project management and integration strategies required to implement these solutions. Successful sustainability practitioners must not only have a strong understanding of the values and methodologies of sustainable operations, but also the tools and technologies available to drive sustainability throughout their organization. Upon completion of the class, students will have a sufficient level of understanding to discuss these solutions and relevant case studies with potential employers. This course will benefit anyone interested in a career in sustainability or in smart cities as it will provide them the skills and analytical capabilities to analyze which sustainability technologies are a good fit for their company's sustainability and growth strategy.
Data Science Capstone & Ethics
Instructor: Smaranda Muresan
Prerequisites: CSOR W4246 Algorithms for Data Science, STAT W4105 Probability, COMS W4121 Computer Systems for Data Science, or equivalent as approved by faculty advisor. Corequisites: to be completed along side of or after: STAT W4702 Statistical Inference and Modeling, COMS W4721 Machine Learning for Data Science, STAT W4701 Exploratory Data Analysis and Visualization, or equivalent as approved by faculty advisor.
This course provides a unique opportunity for students in the M.S. in Data Science program to apply their knowledge of the foundations, theory and methods of data science to address data science problems in industry, government and the non-profit sector. The course activities focus on a semester-length data science project sponsored by a faculty member or local organization. The project synthesizes the statistical, computational, engineering challenges and social issues involved in solving complex real-world problems.
Fall and Spring Semesters: 3 credits