One of the best descriptions of data science comes from David Blei, Professor of Computer Science and Professor of Statistics at Columbia University, and Padhraic Smyth, Professor of Statistics at the University of California, Irvine, as they write in their Proceedings of the National Academies of Sciences (PNAS) article “Science and Data Science” that “data science is the child of statistics and computer science.” They further elaborate:
Data science focuses on exploiting the modern deluge of data for prediction, exploration, understanding, and intervention. It emphasizes the value and necessity of approximation and simplification; it values effective communication of the results of a data analysis and of the understanding about the world and data that we glean from it; it prioritizes an understanding of the optimization algorithms and transparently managing the inevitable tradeoff between accuracy and speed; it promotes domain-specific analyses, where data scientists and domain experts work together to balance appropriate assumptions with computationally efficient methods [Blei and Smyth 2017].
The “child” metaphor appropriately infers that data science inherits (ideally the best) from both its parents, but eventually grows into its own entity. Its focus separates it from its parents.
How is data science different from computer science? Data science embraces uncertainty and approximation as first-class concepts. For both, it uses probability modeling for mathematical formulation and reasoning. In contrast, computer science’s foundations sit squarely on symbolic logic; much of computing rests on the abstraction from voltages to bits. In the logical framework of computer science, uncertainty is traditionally represented as non-determinism. This distinction is a gross over-simplification of computer science, since many subareas of computing use probabilistic reasoning, but often these probabilistic framings are built as scaffolding over its discrete and logic-based elements. Thinking as a computer scientist, but with the perspective of a data scientist, takes us beyond the discrete, combinatorial, and exact.
How is data science different from statistics? Statistics has always been about developing models grounded in probability theory to model data arising from real-world phenomena. As a field, to data science, statistics provides principles and methods for the design of experiments and statistical model building that include model evaluation and assessment, uncertainty quantification, prediction, and data generation. Data science brings to statistics modern computational infrastructures (massive data centers, including clusters of GPUs and FPGAs), large datasets, and algorithmic design and analysis. With Big Compute, Big Data, and efficient algorithms, using statistical models to do prediction, exploration, etc. becomes tractable and scalable. Data science enables statisticians to pursue new and exciting challenges, such as developing models for highly complex systems, that were either unreachable or unfathomable before.
What will be exciting to see is how data science grows up. What new kinds of problems will data science be able to solve? What new techniques will be invented that would not have come into existence if not for the marriage of computer science and statistics? And finally, what will the field of data science look like, or be like, when the child of computer science and statistics enters its adulthood? After all, five decades ago, no one could have predicted the revolutionary change that computer science has had on our lives. Data science possesses the same potential to revolutionize society.
Acknowledgments
I would like to thank my colleagues at Columbia University—David Blei, Richard Davis, and Tian Zheng—for their comments and extremely useful edits to this post.
Reference
[Blei and Smyth 2017] David Blei and Padhraic Symth, “Science and Data Science,” Proceedings of the National Academies of Sciences, vol. 114, no. 33, June 2017, pp. 8689-8692.
Jeannette M. Wing is Avanessians Director of the Data Science Institute and a professor of computer science at Columbia University.