Although data science builds on knowledge from computer science, mathematics, statistics, and other disciplines, data science is a unique field with many mysteries to unlock: challenging scientific questions and pressing questions of societal importance.
Is data science a discipline?
Data science is a field of study: one can get a degree in data science, get a job as a data scientist, and get funded to do data science research. But is data science a discipline, or will it evolve to be one, distinct from other disciplines? Here are a few meta-questions about data science as a discipline.
Ten research areas
While answering the above meta-questions is still under lively debate, including within the pages of this journal, we can ask an easier question, one that also underlies any field of study: What are the research challenge areas that drive the study of data science? Here is a list of ten. They are not in any priority order, and some of them are related to each other. They are phrased as challenge areas, not challenge questions. They are not necessarily the “top ten” but they are a good ten to start the community discussing what a broad research agenda for data science might look like.1
Closing remarks
As many universities and colleges are creating new data science schools, institutes, centers, etc. (Wing, Janeia, Kloefkorn, & Erickson 2018), it is worth reflecting on data science as a field. Will data science as an area of research and education evolve into being its own discipline or be a field that cuts across all other disciplines? One could argue that computer science, mathematics, and statistics share this commonality: they are each their own discipline, but they each can be applied to (almost) every other discipline. What will data science be in 10 or 50 years?
Acknowledgements
I would like to thank Cliff Stein, Gerad Torats-Espinosa, Max Topaz, and Richard Witten for their feedback on earlier renditions of this article. Many thanks to all Columbia Data Science faculty who have helped me formulate and discuss these ten (and other) challenges during our Fall 2019 retreat.
References
Athey, S. (2016). “Susan Athey on how economists can use machine learning to improve policy,” Retrieved from https://siepr.stanford.edu/news/susan-athey-how-economists-can-use-machine-learning-improve-policy
Berger, J., He, X., Madigan, C., Murphy, S., Yu, B., & Wellner, J. (2019), Statistics at a Crossroad: Who is for the Challenge? NSF workshop report. Retrieved from https://hub.ki/groups/statscrossroad
Connelly, M., Madigan, D., Jervis, R., Spirling, A., & Hicks, R. (2019). The History Lab. Retrieved from http://history-lab.org/
Floridi, L. & Taddeo, M. (2016). What is Data Ethics? Philosophical Transactions of the Royal Society A, vol. 374, issue 2083, December 2016.
Garfinkel, S. (2019). Deploying Differential Privacy for the 2020 Census of Population and Housing. Privacy Enhancing Technologies Symposium, Stockholm, Sweden. Retrieved from http://simson.net/ref/2019/2019-07-16%20Deploying%20Differential%20Privacy%20for%20the%202020%20Census.pdf
Liebman, B.L., Roberts, M., Stern, R.E., & Wang, A. (2017). Mass Digitization of Chinese Court Decisions: How to Use Text as Data in the Field of Chinese Law. UC San Diego School of Global Policy and Strategy, 21st Century China Center Research Paper No. 2017-01; Columbia Public Law Research Paper No. 14-551.Retrieved from https://scholarship.law.columbia.edu/faculty_scholarship/2039
Mueller, A. (2019). Data Analysis Baseline Library. Retrieved from https://libraries.io/github/amueller/dabl
Ratner, A., Bach, S., Ehrenberg, H., Fries, J., Wu, S, & Ré, C. (2018). Snorkel: Rapid Training Data Creation with Weak Supervision. Proceedings of the 44th International Conference on Very Large Data Bases.
Strubell E., Ganesh, A., & McCallum, A. (2019),”Energy and Policy Considerations for Deep Learning in NLP. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL).
Taddy, M. (2019). Business Data Science: Combining Machine Learning and Economics to Optimize, Automate, and Accelerate Business Decisions, Mc-Graw Hill.
Wang, Y. & Blei, D.M. (2018). The Blessings of Multiple Causes, Retrieved from https://arxiv.org/abs/1805.06826
Wing, J.M. (2019), The Data Life Cycle, Harvard Data Science Review, vol. 1, no. 1.
Wing, J.M., Janeia, V.P., Kloefkorn, T., & Erickson, L.C. (2018). Data Science Leadership Summit, Workshop Report, National Science Foundation. Retrieved from https://dl.acm.org/citation.cfm?id=3293458
J.M. Wing, “Ten Research Challenge Areas in Data Science,” Voices, Data Science Institute, Columbia University, January 2, 2020. arXiv:2002.05658.
Jeannette M. Wing is Avanessians Director of the Data Science Institute and professor of computer science at Columbia University.