Data Science Day 2021
Wednesday, April 21, 2021
6:00 am - 9:00 am
Wednesday, April 21, 2021
6:00 am - 9:00 am
Data Science Day provides a forum for innovators in academia, industry, and government to connect. The April 21, 2021 virtual event featured two sessions of lightning talks from leading Columbia University faculty members; interactive demonstrations and posters; and a keynote address from Pat Bajari, a Chief Economist at Amazon and Vice President of Amazon’s Core AI team.
All speakers and their respected roles/titles are accurate to time of the event (2021)
Pat Bajari is a Chief Economist at Amazon and Vice President of Amazon’s Core AI team. His team of software engineers and scientists in machine learning, statistics, operations research, and econometrics has helped to build scalable systems for supply chain, transportation, pricing, automated marketing, robotics, forecasting, human resources, and more. Prior to joining Amazon, he was a full time faculty member in economics at Harvard, Stanford, Duke and Minnesota.
Data science is an ever-evolving and expanding field. Here, we explore ways in which it has become an integral part of the decision-making and optimization of countless fields, including patient care, B2B, and design.
Oded Netzer
Arthur J. Samberg Professor of Business, Columbia Business School
Talk Title: Salespeople automation: A Human-Machine Hybrid Approach
Abstract: In a world advancing towards automation, we propose a human-machine hybrid approach to automating decision making in high human interaction environments and apply it in the business-to-business (B2B) retail context. Using sales transactions data from a B2B retailer, we create an automated version of each salesperson, that learns and automatically reapplies the salesperson’s pricing policy. We conduct a field experiment with the B2B retailer, providing salespeople with their own model’s price recommendations in real-time through the retailer’s CRM system, and allowing them to adjust their original pricing accordingly. We find that despite the loss of non-codeable information available to the salesperson but not to the model, providing the model’s price to the salesperson increases profits for treated quotes by 11% relatively to a control condition. Using a counterfactual analysis, we show that while in most of the cases the model’s pricing leads to higher profitability by eliminating inter-temporal human biases, the salesperson generates higher profits when pricing special quotes with unique or complex characteristics. Accordingly, we propose a machine learning hybrid pricing strategy, that automatically allocates quotes to the model or to the human expert and generates profits significantly higher than either the model or the salespeople.
Lydia Chilton
Assistant Professor of Computer Science, Columbia Engineering
Talk Title: AI Tools for Design and Innovation
Abstract: How can computational tools and AI help people be better at innovation and creative problem-solving? When solving a problem, people have the tendency to fixate on one problem or solution. If that one idea doesn’t work, they get stuck. To avoid getting stuck, the design process encourages people to have multiple ideas, and explore the space of possibilities before deciding on a problem or a solution. Although this works, it’s highly complex- requiring people to follow many threads at once. We show how AI and other computational tools can help simplify and speed up the most cognitively taxing aspects of the design process:
Sarah Collins Rossetti
Assistant Professor of Biomedical Informatics and Nursing, Columbia University
Talk Title: Exploiting the Signal Gain of Clinician Expertise in a Predictive Early Warning Score and CDS tool using Nursing EHR data
Abstract: Signals of clinical expertise and knowledge-driven behaviors within EHRs can be exploited to enhance predictive model performance, while increasing interpretability. The scientific premise of the CONCERN study is that while clinicians strive to provide the best care, there is a systematic problem within hospital settings of non-optimal communication between nurses and physicians leading to care delays for at-risk patients. The CONCERN model uses novel signals from nursing documentation, including natural language processing of notes, that are proxies of a nurse’s concern to predict patients at risk of deterioration. Preliminary findings include improved performance and lead time compared to leading early warning scores. Our sharable, standards-based, user-centered clinical decision support CONCERN SmartApp surfaces nurses’ concerns to the interprofessional care team and is being evaluated in a clinical trial across two large academic medical centers to decrease patient deterioration.
Courtney D. Cogburn
Associate Professor of Social Work, Columbia School of Social Work
(Moderator)
This session will highlight advancements in data science, bringing to light causation as opposed to correlation, the use of transfer learning for improving imperfect data, optimization for the improvement of graph problems, and reason to improve differential prediction.
Samory Kpotufe
Associate Professor, Department of Statistics, Columbia University
Talk Title: Big but Imperfect Data: Fundamental Challenges of Domain Adaptation
Abstract: In many ML applications such as healthcare, IoT, finance, perfect representative data is hard to obtain. However much data from related sources is often available, although not adequately representative of the target application. As such, many so-called ‘domain adaptation’ approaches have been developed to harness such large but imperfect data, often with a remarkable degree of success. However, a unified understanding of how and when such imperfect data can help remains elusive, making it hard to build upon previous successes. I’ll attempt to highlight key challenges and promising directions in this problem domain.
Melanie Wall
Professor of Biostatistics (in Psychiatry), Department of Biostatistics, Mailman School of Public Health, Columbia University
Talk Title: Data Science as the Engine for a Learning Health Care Service System for First Episode Psychosis in Coordinated Specialty Care
Abstract: A key initiative in research focused on treatment for first episode psychosis (FEP) is improving the implementation of evidence-based coordinated specialty care (CSC). One area of improvement is expected to come from improved data analytics facilitated by linking different clinical sites through common data elements and a unified informatics approach for aggregating and analyzing patient level data. Through an NIMH funded network and partnerships with the New York Office of Mental Health and the Columbia department of Psychiatry, data science is contributing to a learning health care model. A few examples will be presented including to what extent predictive modeling of patient-level outcomes based on background variables collected at intake and throughout care can be used to differentiate individuals in a way that is useful. Presentation of results will focus on interpretability of differential prediction across sites and usefulness for facilitating service decisions.
Elias Bareinboim
Associate Professor, Department of Computer Science, Columbia University
Talk Title: Causal Data Science
Useful links:
causalai.net/r60.pdf
pnas.org/content/113/27/7345
Abstract: Causal inference provides a set of tools and principles that allows one to combine data and structural invariances about the environment to reason about questions of counterfactual nature — i.e., what would have happened had reality been different, even when no data about this imagined reality is available. Reinforcement Learning is concerned with efficiently finding a policy that optimizes a specific function (e.g., reward, regret) in interactive and uncertain environments. These two disciplines have evolved independently and with virtually no interaction between them. In reality, however, they operate over different aspects of the same building block, i.e., counterfactual relations, which makes them umbilically tied.
Clifford Stein
Professor, Industrial Engineering, Operations Research and Computer Science, Columbia Engineering
Talk Title: Parallel Algorithms for Massive Graphs
Abstract: Large graphs model many important problems in data science. When the graph is too large to fit in the memory of one computer, standard sequential algorithms do not work, or are so slow as to be useless. We will survey some recent progress on efficient parallel algorithms whose performance scales nicely with the size of the graph for many of the well-known basic graph problems such as connectivity, spanning trees, shortest paths and matchings.
Martha Kim
Associate Professor, Computer Science, Columbia University
(Moderator)
Data Science Day included over 60 research projects from Columbia University faculty, students, and affiliated researchers. Attendees joined small breakout groups to meet and network with researchers who are building the next generation of data science methods and applications. The posters were exhibited based on their affiliation to Data Science Institute’s research center and working group topic areas.
DSI Industry Affiliates have access to Data Science Day posters after the event. If you are a current DSI Industry Affiliate, please contact us at datascience@columbia.edu for a link to the videos.