Foundations of Data Science
About the Focus Area
Although data science builds on knowledge from computer science, mathematics, statistics, and other disciplines, data science is a unique field.
“Data science focuses on exploiting the modern deluge of data for prediction, exploration, understanding, and intervention. It emphasizes the value and necessity of approximation and simplification; it values effective communication of the results of a data analysis and of the understanding about the world and data that we glean from it; it prioritizes an understanding of the optimization algorithms and transparently managing the inevitable tradeoff between accuracy and speed; it promotes domain-specific analyses, where data scientists and domain experts work together to balance appropriate assumptions with computationally efficient methods.” – David Blei and Padhraic Symth, “Science and Data Science,” Proceedings of the National Academies of Sciences, vol. 114, no. 33, June 2017, pp. 8689-8692.
What makes data science data science?
What is/are the driving deep question(s) of data science?
What is the role of the domain in the field of data science?
What new kinds of problems will data science be able to solve?
What new techniques will be invented that would not have come into existence if not for the marriage of computer science and statistics?
What will the field of data science look like, or be like, when the child of computer science and statistics enters its adulthood?
- Xi Chen, Computer Science
- Xuan (Sharon) Di, Civil Engineering and Engineering Mechanics
- Qiang Du, Applied Physics and Applied Mathematics
- Eric Talley, Law
This project is developing a fundamental framework using the game theoretic approach to model the strategic interactions of human-driven vehicles and autonomous and/or connected vehicles. Beyond its technical advances, this project also addresses the Trolley Problem (i.e., ethical sense development) in AV algorithm design.
Lydia Liu, East Asian Languages and Cultures
Smaranda Muresan, Data Science Institute
This course introduces a dual view on language diversity: 1) a typology of language vitality and endangerment, and 2) a resource-centric typology (low-resource vs. high-resource) regarding the availability of data resources to develop computational models for language analysis. This course also addresses the challenge of scaling natural language processing technologies developed mostly for English to the rich diversity of human languages. The course brings data and computational literacy about multilingual technologies to humanities students, while also exposing computer science and data science students to ethical, cultural, business, and policy issues within the context of multilingual technologies.
Mentor: Pierre Gentine
Team Members: Aashna Kanuga, Akshata Patell, Arjun Dhillon, Dhananjay Deshpande, Hrishikesh Telang, Somendra Tripathi
Current DNN-based climate models suffer from generalization issues due to the inherent stochasticity of clouds and its highly sparse distribution. This team explored methods like CVAEs, GANs and LSTMs to alleviate these issues. They determined that: GAN needs to be trained for much longer (order of thousands) in order to give sufficiently accurate results; the conditional VAE approach showed promise in capturing the higher-order statistics of precipitation in climate solutions; LSTM models were not able to find patterns using five (5) hours of data inputs per prediction.
Mentors: Felipe Penha, Neoway; Sining Chen, Data Science Institute
Team Members: Timotius Kartawijaya, Nico Winata, Charlene Luo, Fernando Troeman, Jing Yi Zhou
This team developed a tool to generate a sentiment score for individual entities in any given review. They used a random subset of 15,000 restaurant reviews from the Yelp Open Dataset to validate their model. Their methodology was able to generate sentiment scores on identified entities from an arbitrary corpus, with the help of a trained ER model. Their steps were packaged as open-source software. Next steps include comparison with other parsing methods.
Mentors: Amir Rahmani, Jihan Wei, James Krach, Capital One; Tian Zheng, Statistics
Team Members: Karan Sindwani, Lisa Sarah Thomas, Neha Saraf, Raj Biswas
Entity resolution refers to the task of finding all mentions of a same real-world entity within a knowledge base or across multiple knowledge bases. It can reduce complexity by proposing canonicalized references to entities and deduplicating and linking entities. This team used a multi-type graph summarization method that identifies entities in an unsupervised setting on a dataset of products from Google and Amazon.
Jackson Loper (Ph.D. in applied mathematics, Brown University) produces analytical tools to understand datasets arising from new single cell experimental methods. These methods yield measurements for tens of thousands of features of a single cell, and researchers can measure the masses of cells in a single tissue. The result is a data matrix with hundreds of millions of entries. In which ways is it possible—or impossible—to use these kinds of measurements to understand the diversity within cell populations? That is the question he seeks to answer with his research, which also focuses on handling cases of missing data.
Mentors: Liam Paninski, David Blei
Gemma Moran (Ph.D. in statistics, University of Pennsylvania) develops statistical methods for analyzing high-dimensional data, particularly in the sciences. She is working on a project to identify when more complex models for data are required, or whether more simple models are sufficient. Her current collaborative projects are analyzing CRISPR data to identify gene interaction effects, and predicting the formation of perovskites inexpensive materials with promising photovoltaic properties for solar cells.
Mentor: David Blei
Christian Alexander Andersson Naesseth (Ph.D. in electrical engineering, Linköping University) focuses on approximate statistical inference, causality, representation learning, and artificial intelligence. He develops new algorithms, theories, and practical tools to help solve challenging problems in the field of data science.
Mentor: David Blei
Dhanya Sridhar (Ph.D. in computer science, University of California, Santa Cruz) combines modern machine learning techniques with causal inference to study social science questions. She uses text and network data and adapts probabilistic models and deep learning methods to find causality. She also applies causal inference in various ways, such as studying how language affects persuasion or political outcomes, how influence spreads in social networks, and whether algorithmic decisions learned from historical data are fair.
Mentor: David Blei
Christopher Tosh (Ph.D. in computer science, University of California, San Diego) focuses on deriving rigorous guarantees for learning algorithms and representations. His current interests include the representational capabilities of fly olfaction, the design of automated-experimentation algorithms for cancer drug discovery, and the underlying structure of modern artificial neural-network representations.
Mentor: Daniel Hsu